rdflib icon indicating copy to clipboard operation
rdflib copied to clipboard

Streaming parsers

Open gromgull opened this issue 10 years ago • 7 comments

This is a very much incomplete branch for reworking the interface between graphs and parsers.

By introducing a new Sink object, and it becomes possible to write streaming RDF processors that process the triples ''as they come in'' and since you do not store them all in a graph you can work on files much larger than what fits in memory.

As usual, a pull-request to trigger Travis.

gromgull avatar Jul 14 '14 20:07 gromgull

:+1:

joernhees avatar Jul 17 '14 15:07 joernhees

I'd benefit from this feature, and so would (adding up the views) at least 983 + 467 + 5704 + 90 = 7244 other people. I'm going to take a look and see if I can get it to a state where the tests pass and there are no merge conflicts, although I'm not familiar with the codebase and the (existing) code around it is a horrible mess in at least a couple of ways that immediately struck me:

  • there's are Parser and InputSource interfaces, but how exactly they're meant to behave is unclear and all their methods are simply documented TODO:
  • there's both an NTParser and an NTriplesParser, without any explanation of the difference between them

I'll see if I can figure it all out, but no promises.

ExplodingCabbage avatar Feb 06 '16 19:02 ExplodingCabbage

Good lord, 470 errors and 154 failures after merging this into today's code. I might give up on this exercise and leave it to somebody who understands both the codebase and RDF itself better than I do, but I'll keep poking a little first...

ExplodingCabbage avatar Feb 06 '16 19:02 ExplodingCabbage

step 1 should be to rebase this on the current master - it has changed a bit since I did this work.

gromgull avatar Feb 07 '16 08:02 gromgull

The NTriplesParser was (once upon a time), a standalone project, without RDFLib, NTriples is the wrapper that makes it fit the RDFLib parser interface. I think I removed it in some commit here somewhere?

gromgull avatar Feb 07 '16 08:02 gromgull

@nicholascar this is another thing I would consider for a 6.0.0 release!

I am not sure if the work here even sensible as a starting point any more - but making a unified "sink" object across parsers seems like a good idea.

gromgull avatar Mar 28 '20 07:03 gromgull

@gromgull yes I’ve seen this work and agree: unified would be good! I’ve tagged it for 6.0.0 now so it’s on the radar.

Might be one of those good architectural tidy-ups once things like ditching the Py2 and perhaps graph IDs parts have been actioned.

nicholascar avatar Mar 28 '20 09:03 nicholascar