rated-ranking-evaluator
rated-ranking-evaluator copied to clipboard
Escaping colons for Solr - fields erroneously detected
We keep running into failures where Solr thinks we're specifying a field. While we could escape our data to avoid this, it's not trivial and the likely case is that this is happening accidentally for other users as well.
Would it be desirable to escape colons for Solr by default? Would it be more desirable to add a flag to specify that colons should be escaped for Solr?
Happy to put the work in, but not sure what would be the most desirable behaviour.
Hi @AlexAddisonLN thanks for entering this. I'm not sure I got your case. Could you please expand a bit?
Where a query has a colon in it, Solr is detecting it as a field and returning an error. The ratings file entry would look like this:
{
"name": "delivered_document_search",
"description": "Wonderings: a title",
"queries": [
{
"template": "query_template.json",
"placeholders": {
"$query": "Wonderings: a title"
}
}
],
"relevant_documents": {
"3": [
"id1",
"id2",
"id3",
"id4"
]
}
},
and the error looks like this:
[ERROR] Caught Solr exception :: Error from server at http://solr-location:8983/solr: undefined field Wonderings
In our case searches with colons are most likely because that's a document title in our data. It's not clear to me that we should necessarily remove punctuation altogether, as we may at some point have to deal with citations, which can have a wide range of punctuation in them.
While we can do some work to escape punctuation in our rated queries, I was wondering if it would be appropriate to do this escaping in RRE instead, perhaps controlled by a command-line flag or a line in the pom file.
Great, so I agree with you, we need to escape such content.
Thanks again
In that case, want me to work on it and submit a pull request when I think it's ready? I've been warned that it may not be possible or practical to fix it. I've also been told the same issue applies for ElasticSearch, although I don't have any personal experience of that.
I'm also concerned that addressing it may be a breaking change, unless we put it behind a flag of some form.
I'd be in favor of supporting escaping queries in RRE via configuration but default to having this functionality turned off. Lucene supports special characters for specifying various clauses and I think it'll cause headaches for users with such queries if we escape by default.
I see - escaping every string that is injected into the template is not necessarily correct, as the user has freedom to put any string for any purpose in there. For simple purposes, like ours, escaping it is the right thing to do, but someone out there is guaranteed to have all of their stuff broken by this if it's turned on by default, and to make matters worse, it will be much harder for them to debug, because it's the complex cases that would break.
I can see three ways that we could present this change:
- enable by a global flag
- explicit escape flag next to the placeholder (making the JSON more complicated?)
- some separate mechanism for simple queries, leaving placeholders as they are for the more complex cases
I think my favourite would be a global flag to enable placeholder escaping, either in the pom.xml or the ratings file.
Had some more discussion on this and we think a flag for Engine-query mode vs User-query mode would be a good way to define this. The Engine-query mode (default) would not alter the query in any way. The latter User-query mode would escape punctuation. This is pretty much the same as Alex's proposal but makes a formal definition on the use case.