openrefine-wikibase icon indicating copy to clipboard operation
openrefine-wikibase copied to clipboard

Queries preprocessing

Open ettorerizza opened this issue 6 years ago • 5 comments

I wonder if a little bit of automatic preprocessing could not dramatically increase the reconciliation rate. Looks like a simple blank space after an apostrophe completely changes the result.

screencast

ettorerizza avatar Jun 10 '18 15:06 ettorerizza

@ettorerizza This is because of Wikidata's stemmed analyzers for French https://phabricator.wikimedia.org/T180169 I would suggest asking Stas on the Wikidata mailing list about the word_break_helper specifically in this case, before you jump to any conclusions.

thadguidry avatar Jun 10 '18 15:06 thadguidry

And the other thing you will have to be aware of is sometimes its \u2019 and \u2018

{
    "batchcomplete": "",
    "query": {
        "wbsearch": [
            {
                "ns": 0,
                "title": "Q23402",
                "pageid": 26797,
                "displaytext": "D\u2019Orsay"
            },
<snip>

thadguidry avatar Jun 10 '18 15:06 thadguidry

I have discussed that with Stas and we just need to collect examples of queries where the search index does not work for us so that they know what to tweak. We can also do some quick fixes on our own of course, but it's better if the changes are made upstream directly.

wetneb avatar Jun 11 '18 08:06 wetneb

@wetneb Finally, I think the problem comes from the "search for match" feature. Here is an example of a project containing the term "hôtel de ville d'Anvers" with different spellings. Reconciliation works for all. But if I click on "Actions/Discard reconciliation judgments", I cannot find matches for many of them anymore. Looks like "search for match" is case, spaces and apostrophes sensitive

clipboard.openrefine.tar.gz

Do you use this API for this feature ?

https://www.wikidata.org/w/api.php?action=query&list=search&srsearch=h%C3%B4tel%20de%20ville%20anvers&utf8=&format=json

ettorerizza avatar Jun 11 '18 08:06 ettorerizza

yes, search for match uses the same API than the auto-complete dialog of Wikidata's search box, whereas reconciliation uses both that and full-text search (so it's slower: 2 search queries for each reconciliation query). It could be possible to use full-text search for the auto-complete dialog as well but it would be slower and would add a lot of clutter in many cases.

In short: I agree that Wikidata's auto-complete dialog is waaay too sensitive to spelling issues / whitespace / diacritics, and so on - but that's something that should be fixed in Wikidata itself.

wetneb avatar Jun 12 '18 09:06 wetneb

Out of scope: those are improvements for Wikidata's own search features.

wetneb avatar Nov 10 '22 19:11 wetneb