Terms within multiple contexts may not be found
With @sydb, we're trying to debug a problem in the DHQ build where there are many labelled contexts, including contexts for many different languages. When indexing takes place, the "English" context is caught repeatedly and stored in the stem file items multiple times, but the lower-level context (e.g. div[@id='abstract']) do not get noticed, so searching in them fails.
When the language context is redefined as text()[lang('en')], the immediate problem goes away -- the abstract context works OK -- but now the language context no longer works; but this is because the XPath to find that context is not working, so it's not getting in the way.
The root question is: what should happen in the case of nested contexts (a term appears in a structure which has defined contexts at multiple levels). It should "just work", but I don't know if we have given much attention to testing this, so that will be the first thing to do. Another question would be why do we get multiple iterations of the same language context in the stem file; surely those should be unique values.
This needs to be fixed for 1.4.x, but it also needs to be added to tests in 2.0 and fixed there if necessary.
@martindholmes and @sydb, I'm not sure I'm understanding the problem fully -- is it an issue with contexts or with processing of languages (or a problem where the two in combination are causing problems?). A sample with a config might be useful here just to make it a bit clearer as to what ought to be tested.
That said, to answer your question @martindholmes :
what should happen in the case of nested contexts (a term appears in a structure which has defined contexts at multiple levels
I believe (and thought) that contexts would always be inherited-- e.g. if you have this:
<div>
<p>This article contends and argues...</p>
</div>
<div class="text">
<div class="abstract">In this article, we argue</div>
<div class="body">
<p>This article contends...</p>
</div>
</div>
and you have these contexts defined as text (div[@class='text']), abstract (div[@class='abstract']), and
body (div[@class='body']), then I would think that you should get the following:
| term | context | hits |
|---|---|---|
| article | [none] | 3 |
| text | 2 | |
| abstract | 1 | |
| body | 1 | |
| argue | text | 1 |
| abstract | 1 | |
| body | 0 |
The way these contexts are passed through tokenization has been much improved in 2.0, so hopefully this has been improved by 2.0 (or if not it might be an easier fix?)
Until I've added a nested context test, and another one with the added complication of using the lang() function, I won't know for sure what's happening. I'm hoping to do that tomorrow, and I'll do it both in dev and in 1.4-release, so we can be clear about the issue.
Initial work on the test to confirm the problem is happening in branch issue-345-contexts which branches from release-1.4.
It turns out that our tests already include a simple instance of a nested context: in the "clues.html" file, there is a <blockquote> (context: "Quotations") which contains within it a <p class="citation"> (context: "Citations"), and searching for the word "Amending", which appears in the latter and therefore also in the former, we get (correctly) a single hit whether either or both of those contexts are selected. So the problem here is not with nesting itself, unless there is something in the code (@joeytakeda do you know of anything?) which would trigger a problem when we encounter the same context nested multiple times within itself.
So my suspicion is that the problem here is related to the lang() function, which is testing true repeatedly down the tree -- and in fact, it will test true for the very item which is also supposed to be testing true for the other context. In other words, a single element will match both contexts. My next strategy is going to be reproducing that scenario in the test branch: in other words, where a single item matches multiple contexts (as opposed to an ancestor matching one, and a descendant matching the other), do both contexts get added, or only one?
Incidentally @sydb I think an alternative and perhaps better approach might be to replace the lang()-based contexts with a set of description filters titled "Documents containing any of these languages" or something like that, so that you're not trying to constrain the search to specific bits of documents, but instead to documents which have content in the target languages (so you would add a meta tag for this filter for each language contained in the document). Most of the time, when you're searching for terms in a specific language, they're going to be unique to that language anyway, so the result would be the same, but the mechanics would be far simpler.
That would also enable someone to simply list all the documents containing a particular language, without having to search for any term at all, which is very useful in itself.
From what I can understand (and apologies, I feel like I may be missing something here), I think this is the major bug of 1.4 that we are resolved through the new tokenization process. This is, I think, precisely the use case that we sought to fix, which is explained in much better detail in #219 (see https://github.com/projectEndings/staticSearch/issues/219#issuecomment-1049336069).
I won't duplicate that information here, but the long and the short of it, is that in 1.* , the tokenization code was not handling multiple matches for a single context (or any template produced by configuration elements of the same kind) since it would recreate attributes rather than tunnelling values. So I think this is already resolved in 2.0 and can't really be back-ported to 1.* since it requires the revamped tokenization logic.
@joeytakeda I'd forgotten most of that issue, but re-reading it now, it suggests (correct me if I'm wrong) that I should be able to reproduce the problem in 1.4 by creating two contexts which match the same element; one of them would be lost, so searching in that context would fail. If I can do that simply, then I can try the same thing in 2.0, and that should demonstrate that the problem is solved by the new tokenization approach.
That does sort of work -- when I create a third context <context match="blockquote/p[contains-token(@class,'citation')]" label="Blockquote citations"/>, the stem is indeed only associated with two contexts, but they are the two which match precisely on the p element; the ancestor context, the containing blockquote, is the one which is lost.
@joeytakeda Interestingly, my testing suggests that the problem is actually worse in dev. I added the new context (in branch issue-345-dev):
<context match="blockquote/p[contains-token(@class,'citation')]" label="Blockquote citations"/>
Now I would expect that searching for "Amending" with any of the three matching contexts selected would find the hit:
<context match="blockquote" label="Quotations"/> (ancestor element match)
<context match="p[contains-token(@class,'citation')]" label="Citations"/> (Exact match on element)
<context match="blockquote/p[contains-token(@class,'citation')]" label="Blockquote citations"/> (Exact match, but more precise)
In fact, only one context (the first) works. But the issue isn't with context detection itself; it's with the structuring of the JSON in the stem file:
{
"stem": "amend",
"instances": [
{
"docUri": "clues.html",
"score": 2,
"contexts": [
{
"form": "Amending",
"weight": "1",
"pos": 66,
"context": "<mark>Amending</mark> pub sign, add in Cook’s vessel (7,5...",
"fid": "q1",
"in": [
"ssCtx1",
"ssCtx4 ssCtx3"
]
},
{
"form": "“amending",
"weight": "1",
"pos": 90,
"context": "...to decode an anagram is provided by the word <mark>“amending</mark>.” The complete clue suggests perhaps the addition of...",
"fid": "p2"
}
]
}
]
}
where ssCtx4 and ssCtx3 form part of a single string. I think that should be an easy fix -- or at least I hope so -- and if it is, then the problem will definitively be solved in version 2.
@joeytakeda The problem could be worked around by doing tokenization around line 433 of json.xsl, or by changing line 87 from this:
<xsl:accumulator-rule match="*[@ss-ctx-id]" select="($value, @ss-ctx-id)" phase="start"/>
to this:
<xsl:accumulator-rule match="*[@ss-ctx-id]" select="($value, tokenize(@ss-ctx-id, '\s+'))" phase="start"/>
(This causes failures in the test suite, but that's probably because of other expected fallout from the change to the config file.)
I'm also unsure whether we would need to apply a similar remedy to line 88 of the file, because I'm not 100% sure what that's doing.
Thanks @martindholmes — that's a good catch and seems right to me.
All tests had passed in dev, so unless there are other changes, then I think the problem probably is due to line 88, which does indeed to be adjusted. It's slightly more complicated, though, because what this rule is doing is popping off the last accumulated value (e.g. it adds a value when the accumulator encounters the opening tag, but removes it when it hits the closing tag).
So this will need to be adjusted a bit to account for multiple values--what needs to happen I believe is that we should count how many contexts are declared on that element and then remove that many from the end. So I suppose something like this would do it:
<xsl:accumulator-rule match="*[@ss-ctx-id]"
select="$value[position() lt (1 + last() - count(tokenize(@ss-ctx-id))]" phase="end"/>
Or something like this, which I think is equivalent (and perhaps a bit more readable?):
<xsl:accumulator-rule match="*[@ss-ctx-id]" phase="end"
select="subsequence($value, 1, count($value) - count(tokenize(@ss-ctx-id)))"/>
Did you test in my branch from dev, issue-345-dev? I was seeing test fails.
I've now just committed the @phase='end' adjustment (016464d) , and all of the tests are passing for me locally
Tests passing for me too. I'll add a specific test addressing this issue.
I believe this is only open because I was intending to add a specific test for this. Given that this is not urgent, I'm going to push this to 2.1 in the hope that we can get 2.0 out soon.