jena icon indicating copy to clipboard operation
jena copied to clipboard

Improved AsyncParser API

Open Aklakan opened this issue 3 years ago • 0 comments

Version

4.6.0-SNAPSHOT

Feature

Issues:

  • Closing the iterator's returned by AsyncParser does not abort the parsing process. In fact, repeatedly abandoning iterators will cause parsing threads to silently pile up.
  • AsyncParser's default chunk size of 100K tuples introduces a long delay unsuitable for content probing
  • The EltStreamRDF is private. As mentioned in JENA-2309 those events would be useful in an hadoop/spark setting to scan for prefixes, thereby stopping the parser once only data is seen anymore.

PR https://github.com/apache/jena/pull/1478 adds the following improvements:

  • Changed AsyncParser API to return IteratorCloseables whose close() method actually cancels parsing.
  • Added a public EvtStreamRDF interface for the parsing events. The existing private EltStreamRDF class remains as the internal data object. The naming is up for discussion :)
  • Added a Builder that gives control over chunk and queue sizes: AsyncParserNew.Builder.of(in, Lang.TRIG, null).setChunkSize(100).asyncParseQuads();. The builder also has a new asyncParseIterator method which returns IteratorCloseable<EvtStreamRDF>.
  • If a parser fails then all remaining parsers are still started with a destination in 'interrupted state' in order for them to close their resources.

Are you interested in contributing a solution yourself?

Yes

Aklakan avatar Aug 12 '22 15:08 Aklakan