rdflib
rdflib copied to clipboard
Streaming parsers
This is a very much incomplete branch for reworking the interface between graphs and parsers.
By introducing a new Sink
object, and it becomes possible to write streaming RDF processors that process the triples ''as they come in'' and since you do not store them all in a graph you can work on files much larger than what fits in memory.
As usual, a pull-request to trigger Travis.
:+1:
I'd benefit from this feature, and so would (adding up the views) at least 983 + 467 + 5704 + 90 = 7244 other people. I'm going to take a look and see if I can get it to a state where the tests pass and there are no merge conflicts, although I'm not familiar with the codebase and the (existing) code around it is a horrible mess in at least a couple of ways that immediately struck me:
- there's are
Parser
andInputSource
interfaces, but how exactly they're meant to behave is unclear and all their methods are simply documentedTODO:
- there's both an
NTParser
and anNTriplesParser
, without any explanation of the difference between them
I'll see if I can figure it all out, but no promises.
Good lord, 470 errors and 154 failures after merging this into today's code. I might give up on this exercise and leave it to somebody who understands both the codebase and RDF itself better than I do, but I'll keep poking a little first...
step 1 should be to rebase this on the current master - it has changed a bit since I did this work.
The NTriplesParser
was (once upon a time), a standalone project, without RDFLib, NTriples
is the wrapper that makes it fit the RDFLib parser interface. I think I removed it in some commit here somewhere?
@nicholascar this is another thing I would consider for a 6.0.0 release!
I am not sure if the work here even sensible as a starting point any more - but making a unified "sink" object across parsers seems like a good idea.
@gromgull yes I’ve seen this work and agree: unified would be good! I’ve tagged it for 6.0.0 now so it’s on the radar.
Might be one of those good architectural tidy-ups once things like ditching the Py2 and perhaps graph IDs parts have been actioned.