BlackLab icon indicating copy to clipboard operation
BlackLab copied to clipboard

Switch default XML parser to Saxon?

Open jan-niestadt opened this issue 3 years ago • 1 comments

BlackLab uses the XML library VTD-XML by default for processing documents while indexing. This only supports XPath 1.0.

@eduarddrenth made it possible to use Saxon, a more feature-rich (supports XPath 3) and potentially faster alternative, but it does use more memory while indexing. This may not be a problem in most cases, however.

We should consider changing the default to Saxon, while keeping VTD-XML available for those who want it. If we decide to do this, we should be careful about breaking backwards compatibility.

One solution would be to version .blf.yaml files. e.g. if the file starts with

version: 2

# What element starts a new document?
documentPath: //document

...

it automatically defaults to Saxon instead of VTD-XML. We should clearly document the change as well, of course.

Some older (and, dare I say, janky) features could be deprecated if Saxon's better XPath support obviates the need for them.

jan-niestadt avatar Jul 23 '22 15:07 jan-niestadt

Multiple values are now supported, see #393 and #394. Using processing steps on annotations or standoffAnnotations produces an error. Those can likely be done in XPath 3, so therefore wouldn't need a special feature anymore. We still need to test this more before thinking of switching the default parser though.

jan-niestadt avatar Feb 21 '23 14:02 jan-niestadt