BlackLab
BlackLab copied to clipboard
Allow all valid XML element names (indexing and querying)
Dash and period in field, annotation and tag names were not allowed (partially solved now, see below).
E.g. this query did not parse:
<named-entity/> containing "dog"
but this one does:
<namedentity> containing "dog"
We should make sure that all valid XML tag names can be used in indexing and CQL queries as well.
Dash is now partially allowed, see 6e1f1a51f and 099feb20d. Most importantly, querying XML tags with dashes shouldn't cause issues anymore.
A related issue is that dot should also be allowed, e.g. <named.entity type='person' />
(it is valid in XML element names), but this is trickier because global constraints like A:[] 'and' B:[] :: A.word = B.word
exist and need to be parsed as separate tokens with a .
operator in between. So there should be a separate TAGNAME
token in the parser. You would probably need JavaCC lexical states to deal with this: https://javacc.github.io/javacc/faq.html#question-3.14 Doable but not a priority right now.
Another issue is that dash is still sanitized in input format configs. Removing this sanitization rule could cause compatibility issues. We'll do this when we introduce a new version of the input format config. The warning reflects this now.