simple-wikidata-db icon indicating copy to clipboard operation
simple-wikidata-db copied to clipboard

Related work: Wikidata "subsetting" collaborations

Open danbri opened this issue 2 years ago • 3 comments

Hi @neelguha! this is neat. It is close to the concerns of some in the Wikidata community around "subsetting", so I've linked it from the Tools section in https://www.wikidata.org/wiki/Wikidata:WikiProject_Schemas/Subsetting#Tools_and_Data

One of the reasons folk are interested in Wikidata subsets is that it can be too large a dataset to work with comfortably - so pulling out just the bits most relevant to some application is appealing. There's also a concern to encourage offsite usage of the data so that the load on query.wikidata.org remains manageable while the project and datasets grow. In both cases, tools like yours seem relevant, although the problem of characterising what goes in the subset can be tricky.

danbri avatar Sep 21 '21 17:09 danbri

Hi @danbri! Thanks for the note. It's lovely to hear that you're a fan, and thanks for sharing it on the wikidata page.

We actually created this while working on Bootleg--for much the same reasons you've cited. We frequently wanted to pull triples for a certain entity, or find all entities which had a certain property (e.g. the alias "Lincoln"). Using query.wikidata.org started becoming inefficient.

Please let me know if there are any updates or features you think could make this more useful to a larger community!

neelguha avatar Sep 21 '21 18:09 neelguha

There's talk of having more meetings around subsetting - will let you know if anything comes of it

Oh, if you dig around in https://github.com/google/schemarama/tree/main/kgx you'll find SPARQL queries that pull out some lifescience-related pieces of Wikidata (intern work that I should finish opensource releasing!). The goal we were pursuing there was to try to extract from Wikidata, only those entities/relationships corresponding to Figure 1 in https://elifesciences.org/articles/52614 . This shouldn't be rocket science but turns out to be fiddly: the data dumps are huge and unwieldy, as you note. And the official SPARQL endpoint is heavily loaded. Some related work https://addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits/ might help there, but definitely still fiddly!

danbri avatar Sep 23 '21 15:09 danbri

Interesting! I'll poke around -- thanks for the pointer!

neelguha avatar Sep 23 '21 15:09 neelguha