jena
jena copied to clipboard
Improved AsyncParser API
Version
4.6.0-SNAPSHOT
Feature
Issues:
- Closing the iterator's returned by AsyncParser does not abort the parsing process. In fact, repeatedly abandoning iterators will cause parsing threads to silently pile up.
- AsyncParser's default chunk size of 100K tuples introduces a long delay unsuitable for content probing
- The EltStreamRDF is private. As mentioned in JENA-2309 those events would be useful in an hadoop/spark setting to scan for prefixes, thereby stopping the parser once only data is seen anymore.
PR https://github.com/apache/jena/pull/1478 adds the following improvements:
- Changed AsyncParser API to return IteratorCloseables whose close() method actually cancels parsing.
- Added a public EvtStreamRDF interface for the parsing events. The existing private EltStreamRDF class remains as the internal data object. The naming is up for discussion :)
- Added a Builder that gives control over chunk and queue sizes:
AsyncParserNew.Builder.of(in, Lang.TRIG, null).setChunkSize(100).asyncParseQuads();. The builder also has a newasyncParseIteratormethod which returnsIteratorCloseable<EvtStreamRDF>. - If a parser fails then all remaining parsers are still started with a destination in 'interrupted state' in order for them to close their resources.
Are you interested in contributing a solution yourself?
Yes