basex
basex copied to clipboard
Full-Text Indexing: Mixed Content
Presently, only text nodes and attribute values end up in the BaseX indexes. Whenever a path expression points to a text node (or an element that only has text nodes as children), it can be rewritten for index access, no matter how the full paths look like. This design decision turned out to be powerful for exact searches and for full-text queries on arbitrary text nodes, but it is too unflexible for mixed-content data.
A few years ago, we added features to restrict indexing to the text nodes of specific element names. We could enhance this approach for full-text queries:
- Index the string value of specific elements that will be specified via
FTINCLUDE
and - rewrite only paths for index access that do not address descendants of the indexed element.
As an example, a user might want to query the head
and p
elements of a TEI document:
<div>
<head>No. 2, September 2006</head>
<p>It was clearly popular, for it appears in Peter Stent’s
advertisements of 1654 and 1662, and is still listed in his successor
John Overton’s catalogue of 1673,<note>Alexander Globe, <title
level="m">Peter Stent, London Printseller, c.</title> 1642-65
(Vancouver, 1985), p. 123 (no.*448).</note> yet only the unique
impression in the British Museum's Department of Prints and Drawings
survives - testimony to the great rarity of such popular material.</p>
</div>
The following queries could then be evaluated via the index:
/div[head contains text '2006']
//p[. contains text 'popular']
Queries such as the following ones would not be rewritten for index access anymore:
//p[text() contains text 'popular']
Why not rewrite //p[text() contains text 'popular'] as //p[text()[. contains text 'popular']] would it then use the index??
Or, betterr maybe, text()[. contains text 'popular']/..[self::p] ?
If we can assess at compile time that all p
elements in a database are leaf elements (i.e., have a single text node as child), we could indeed rewrite //p[text() contains text 'popular']
for index access, too.
Otherwise, if p
elements have child elements, we don’t know which substring of the indexed text occurs in that text node. The following two expressions will yield a different result:
<p>popular<suffix>s</suffix></p> contains text 'popular',
<p>popular<suffix>s</suffix></p>/text() contains text 'popular'
Without regard to practicality of indexing (because I have no idea!),
//p[normalize-space(.) contains text 'popular']
is what I'm usually after -- where is this phrase in the document? There can be a lot of inline markup and for "where's the phrase?" purposes I want to know the nearest common ancestor of all the text nodes in the phrase.
For finding the nearest common ancestor elements, it’s still recommendable to search on text node level:
let $xml := document {
<p>
There’s is a <b>popular</b> saying …
</p>
}
return $xml//p//text()[. contains text 'popular']/ancestor::*[1] (: → <b>...</b> :)
If nodes are atomized, things are getting complicated because the found tokens may appear on different node levels. The token in the following query is assembled from the child text nodes of p
and b
:
let $xml := document {
<p>There’s is a <b>p</b>opular saying …</p>
}
return $xml//p[. contains text 'popular'] (: → <p>...</p> :)
About normalize-space(.)
, full-text tokenization includes this (so you can replace normalize-space(.)
by .
), and it additionally removes diacritics, normalizes upper/case, etc. The behavior can be made explicit by calling ft:tokenize.
I managed to express the use case in a muddled way; apologies!
let $xml := document { <bucket> <title>Complex Reference</title> <p>There's a <i>complex <link>reference</i> to this document.</p> </bucket> } return $xml//*[. contains text 'complex reference']
This is the kind of search I want to do against a relatively large amount of content (e.g., a national legal code) where the specific element is not known and could in principle be one of a number of elements and in practice is a variety of elements expressing different semantics and in some cases you want the titles and in other cases the references but the first step is to find everywhere the phrase occurs. The goal is to get the closest containing element with all the text nodes of the searched phrase.
The query above returns all the elements of which it is true, which is what it's supposed to do:
<bucket>
<title>Complex Reference</title>
<p>There's a <i>complex <link>reference</link>
</i> to this document.</p>
</bucket>
<title>Complex Reference</title>
<p>There's a <i>complex <link>reference</link>
</i> to this document.</p>
<i>complex <link>reference</link>
</i>
But ideally there'd be a way to do the "closest ancestor" version with the case where it's a multi-word phrase with components in different text nodes. My (probably naive) thought is that maybe there could be an index of string properties of elements, which would allow returning the closest containing element of the full-text match.
Postponed to a later version.