openrefine-wikibase
openrefine-wikibase copied to clipboard
Queries preprocessing
I wonder if a little bit of automatic preprocessing could not dramatically increase the reconciliation rate. Looks like a simple blank space after an apostrophe completely changes the result.
@ettorerizza This is because of Wikidata's stemmed analyzers for French https://phabricator.wikimedia.org/T180169 I would suggest asking Stas on the Wikidata mailing list about the word_break_helper specifically in this case, before you jump to any conclusions.
And the other thing you will have to be aware of is sometimes its \u2019 and \u2018
{
"batchcomplete": "",
"query": {
"wbsearch": [
{
"ns": 0,
"title": "Q23402",
"pageid": 26797,
"displaytext": "D\u2019Orsay"
},
<snip>
I have discussed that with Stas and we just need to collect examples of queries where the search index does not work for us so that they know what to tweak. We can also do some quick fixes on our own of course, but it's better if the changes are made upstream directly.
@wetneb Finally, I think the problem comes from the "search for match" feature. Here is an example of a project containing the term "hôtel de ville d'Anvers" with different spellings. Reconciliation works for all. But if I click on "Actions/Discard reconciliation judgments", I cannot find matches for many of them anymore. Looks like "search for match" is case, spaces and apostrophes sensitive
Do you use this API for this feature ?
https://www.wikidata.org/w/api.php?action=query&list=search&srsearch=h%C3%B4tel%20de%20ville%20anvers&utf8=&format=json
yes, search for match uses the same API than the auto-complete dialog of Wikidata's search box, whereas reconciliation uses both that and full-text search (so it's slower: 2 search queries for each reconciliation query). It could be possible to use full-text search for the auto-complete dialog as well but it would be slower and would add a lot of clutter in many cases.
In short: I agree that Wikidata's auto-complete dialog is waaay too sensitive to spelling issues / whitespace / diacritics, and so on - but that's something that should be fixed in Wikidata itself.
Out of scope: those are improvements for Wikidata's own search features.