openrefine-wikibase icon indicating copy to clipboard operation
openrefine-wikibase copied to clipboard

String search beginning with "Wikidata:" results in strange behavior

Open diegodlh opened this issue 3 years ago • 6 comments

I'm trying to reconcile some entities whose titles begin with "Wikidata:". See, for example, Q18507561, Q27042516, Q21503284, Q52824698.

I'm sending a POST request to the reconciliation API (both the https://wikidata.reconci.link/en/api and my local instances show the same behavior) with this data:

queries={"q0":{"query":<query_string>,"type":"Q386724","type_strict":"should","properties":[]}}

where <query_string> is, for example, "Wikidata: A New Platform for Collaborative Data Collection" (Q27042516).

This request returns an empty result array:

{
  "q0": {
    "result": []
  }
}

I repeat the request removing the colon after "Wikidata" in the query string (replacing the colon with another character seems to work as well). This time the request returns the expected ID (Q27042516).

Surprisingly, if I repeat the original request now, this time it does return the expected ID too. This seems to be a caching issue. Closing the local instance and starting it again with docker-compose build and docker-compose up does not seem to restart the cache (I'm not familiar with Docker, so I'm not sure what I'm supposed to do to restart it).

I'm not sure if this is specific to query strings beginning with "Wikidata:", but I tried query string "YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia" and had the expected result at first try (i.e., no need to try the colon-free query string).

diegodlh avatar May 11 '21 19:05 diegodlh

Interesting observation.

Do you think it could be related to Mediawiki namespaces?

I.e. File:, User:, Template:?

antoine2711 avatar May 12 '21 16:05 antoine2711

You are right! That's the problem indeed. It looks to me like a MediaWiki-API bug. I've reported it here.

In short, in the query+search request here, the srnamespace parameter seems to be overridden by srsearch values beginning with "<Namespace>:", where <Namespace> is a valid Wikidata parameter (e.g., Wikidata, User, etc). As a result, the API returns results from non-Main namespace (i.e., non-QIDs).

This non-QID strings finally are used to get entitites with wbgetentities, which results in an error.

I guess we could include some QID validation step, either at the end of the _srsearch function, or somewhere in the get_items function, but I wonder if all Wikibase instances follow the same /Q[0-9]+/ QID pattern. What do you think?

Surprisingly, if I repeat the original request now, this time it does return the expected ID too. This seems to be a caching issue.

The reason why a non-empty array result is returned after trying a query without the colon is probably because a correct result is retrieved from the local cache at this step.

diegodlh avatar May 13 '21 20:05 diegodlh

@diegodlh : this is more and more interesting.

My first impression here, to fix this issue, would be to prefix the search string with the Main Space of Wikidata (and maybe check first if it's not there already…). That way, we would fix our problem here.

Next thing that this makes me think about is: could I search for properties with this?! ;-) But this is a whole concept in itself, pretty OT of this issue.

For the suggestion of Qid validation, I would this this is good, but yes, wikibase for Wikimedia Commons uses Mids, and I think WD also have Eid, and Lid (E for EntitySchema, M for MediaInfo, etc.) Here's the list for WD: Namespaces.

So I guess the check should take that in account, if we could have this tool query those elements.

Are you looking to code these things?

Regards, Antoine

antoine2711 avatar May 13 '21 20:05 antoine2711

prefix the search string with the Main Space of Wikidata (and maybe check first if it's not there already…).

What would that be? I tried "Main:", but it didn't work.

In my code I finally removed the colon from query strings beginning with "some_string:" as a workaround, although adding a space at the beginning of the query string (i.e., " Wikidata: ..." instead of "Wikidata: ...") works as well.

could I search for properties with this?!

I think so, for example here. But you could also set the srnamespace to 120 (which refers to the "Property" namespace in Wikidata).

For the suggestion of Qid validation, I would this this is good, but yes, wikibase for Wikimedia Commons uses Mids

I guess we may have something like a wikibase_id_prefix in the config which defaults to Q for Wikidata.

and I think WD also have Eid, and Lid (E for EntitySchema, M for MediaInfo, etc.)

But those belong to different namespaces, so once a namespace is selected, one should only get QIDs, EIDs, LIDs, etc.

Are you looking to code these things?

Would you agree to wait and see what they say at the MediaWiki-API bug first?

diegodlh avatar May 13 '21 22:05 diegodlh

prefix the search string with the Main Space of Wikidata (and maybe check first if it's not there already…).

What would that be? I tried "Main:", but it didn't work.

In my code I finally removed the colon from query strings beginning with "some_string:" as a workaround, although adding a space at the beginning of the query string (i.e., " Wikidata: ..." instead of "Wikidata: ...") works as well.

Can you try just adding a semicolon with no text before? i.e. :Wikidata: A New Platform for Collaborative Data Collection

could I search for properties with this?!

I think so, for example here. But you could also set the srnamespace to 120 (which refers to the "Property" namespace in Wikidata).

If I had more time, I would do more than just try that now… ;-)

For the suggestion of Qid validation, I would this this is good, but yes, wikibase for Wikimedia Commons uses Mids

I guess we may have something like a wikibase_id_prefix in the config which defaults to Q for Wikidata.

and I think WD also have Eid, and Lid (E for EntitySchema, M for MediaInfo, etc.)

But those belong to different namespaces, so once a namespace is selected, one should only get QIDs, EIDs, LIDs, etc.

Yah.

Are you looking to code these things?

Would you agree to wait and see what they say at the MediaWiki-API bug first?

I'm in no hurry for that. I don't have much time to code. But I do know a few thing I would change or add. ;-)

Regards, Antoine

antoine2711 avatar May 13 '21 23:05 antoine2711

Can you try just adding a semicolon with no text before? i.e. :Wikidata: A New Platform for Collaborative Data Collection

Yes, I'd tried and it also work. But I think it does not because it is using the "": namespace, but because ":Wikidata" doesn't match any namespace, so the srnamespace parameter doesn't get overridden. Same thing happens with " Wikidata", which doesn't match any namespace either. You can easily try any combinations here.

diegodlh avatar May 14 '21 00:05 diegodlh

This is a problem on the Wikibase side, not in this repository.

wetneb avatar Nov 10 '22 19:11 wetneb