BlackLab icon indicating copy to clipboard operation
BlackLab copied to clipboard

Allow all valid XML element names (indexing and querying)

Open jan-niestadt opened this issue 2 years ago • 1 comments

Dash and period in field, annotation and tag names were not allowed (partially solved now, see below).

E.g. this query did not parse:

<named-entity/> containing "dog"

but this one does:

<namedentity> containing "dog"

We should make sure that all valid XML tag names can be used in indexing and CQL queries as well.

jan-niestadt avatar Nov 11 '22 13:11 jan-niestadt

Dash is now partially allowed, see 6e1f1a51f and 099feb20d. Most importantly, querying XML tags with dashes shouldn't cause issues anymore.

A related issue is that dot should also be allowed, e.g. <named.entity type='person' /> (it is valid in XML element names), but this is trickier because global constraints like A:[] 'and' B:[] :: A.word = B.word exist and need to be parsed as separate tokens with a . operator in between. So there should be a separate TAGNAME token in the parser. You would probably need JavaCC lexical states to deal with this: https://javacc.github.io/javacc/faq.html#question-3.14 Doable but not a priority right now.

Another issue is that dash is still sanitized in input format configs. Removing this sanitization rule could cause compatibility issues. We'll do this when we introduce a new version of the input format config. The warning reflects this now.

jan-niestadt avatar Nov 14 '22 10:11 jan-niestadt