qlever
qlever copied to clipboard
Load SDC dump to Wikidata endpoint
Have you considered supporting SDC (Structured Data on Commons) yet? It is the wikibase of metadata regarding images within Wikimedia Commons, and it uses Wikidata items and properties as its vocabulary. In practice, it expands Wikidata with more images and depiction information.
The support might be as easy as loading the SDC dump into the Wikidata endpoint. Alternatively, there could be a separate SDC endpoint but it would also need to contain (a subset of) Wikidata.
The RDF dumps are available here: https://dumps.wikimedia.org/other/wikibase/commonswiki/
More on SDC: https://commons.wikimedia.org/wiki/Commons:Structured_data
EDIT: Documentation of the triples in the dump: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#MediaInfo and https://www.mediawiki.org/wiki/Extension:WikibaseMediaInfo/RDF_mapping
@tuukka Thanks for the suggestion. I am downloading it right now (that takes a few hours) and will build a QLever instance for it (that will take a few more hours). Looking forward to what's in there, especially since it appears to be quite big (37 GB bz2-compressed).
Do you know why the WCQS does not have unauthenticated access like the WDQS does?
And can you provide one or two useful example queries?
Do you know why the WCQS does not have unauthenticated access like the WDQS does?
As I understand it: performance reasons - WMF is unwilling to provide more endpoints while there is no solution to the performance needs of the regular Wikidata Query Service either.
And can you provide one or two useful example queries?
Here's some WCQS example queries from the community: https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service/queries/examples
To start with a comparison, in Wikidata, you get image(s) of Douglas Adams like this:
SELECT ?image {
wd:Q42 wdt:P18 ?image . # Douglas Adams
}
The result is e.g. image http://commons.wikimedia.org/wiki/Special:FilePath/Douglas%20adams%20portrait%20cropped.jpg
In SDC, you can get the image above and all other images depicting Douglas Adams like this:
SELECT ?file ?image {
?file wdt:P180 wd:Q42 . # depicts: Douglas Adams
?file schema:url ?image .
}
And the result is e.g. file https://commons.wikimedia.org/entity/M10031710 image http://commons.wikimedia.org/wiki/Special:FilePath/Douglas%20adams%20portrait%20cropped.jpg so the same image URL as above.
Combining ontology information from Wikidata, you can query e.g. all quality images depicting any hummingbird species: [original source]
SELECT ?file ?image {
?species wdt:P171/wdt:P171* wd:Q43624. # parent taxon: hummingbird
?file wdt:P180 ?species . # depicts
?file wdt:P6731 wd:Q63348069 . # Commons quality assessment: Commons quality image
?file schema:url ?image .
}
The instance is up and running now (it took < 2 h to build it). Here are links to your two example queries:
https://qlever.cs.uni-freiburg.de/wikimedia-commons/4TOZwl
https://qlever.cs.uni-freiburg.de/wikimedia-commons/MyAdzj
Wow, thank you! I didn't realise you have already implemented federated queries with the SERVICE
keyword too.
I've let some people now in the Wikimedia Hackathon know about this (unfortunately I couldn't attend), this can be very valuable to everyone building tools for SDC.
The holy grail application of this would be faceted search, do you have any tips regarding that? I found ql:has-predicate
, is that what we should build on top of? And you wouldn't happen to have a UI similar to this already? :grin: https://github.com/joseignm/GraFa
What I have so far: Property counts https://qlever.cs.uni-freiburg.de/wikidata/XBe4M8 Object counts https://qlever.cs.uni-freiburg.de/wikidata/zu5gUm
Isn't the context-sensitive autocompletion of the QLever UI doing this (and much more)?
For example, if you go to https://qlever.cs.uni-freiburg.de/wikidata you can
- Type
S
and hitReturn
to get theSELECT * WHERE { ... }
query template. - Type a variable name, for example
subject
- Type any prefix of
instance of
(or any other alias of wdt:P31) and selectwdt:P31/wdt:P279*
from the list of suggestions - Type the prefix of any class (for example,
per
for Person) and select from the list of suggestions - Execute the query
You can incrementally construct arbitrary queries that way.
PS: You can also take your query and extend it by a prefix filter, likes so (prefix filters are very efficient in QLever):
https://qlever.cs.uni-freiburg.de/wikidata/nN9IDv
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?object (SAMPLE(?object_label) AS ?label) (COUNT(?object) as ?count) WHERE {
?item wdt:P18 ?image .
?item wdt:P31/wdt:P279* wd:Q838948 .
?item wdt:P180 ?object .
?object rdfs:label ?object_label .
FILTER (LANG(?object_label) = "en") .
FILTER REGEX(STR(?object_label), "^per")
}
GROUP BY ?object ?object_label
ORDER BY DESC(?count)
Isn't the context-sensitive autocompletion of the QLever UI doing this (and much more)?
We have mostly in mind end users who don't understand Sparql :grin: But yes, the functionality is more or less there in the current editor - that's how I figured it should be feasible. Although, I don't see the counts for the autocompletion candidates in the UI :thinking:
PS: You can also take your query and extend it by a prefix filter
Good point. I think I need to add a LIMIT
to not have too many options client-side, and instead have a server-side prefix filter like that.
I think I'll first add simple, non-faceted depictions to Wikidocumentaries though, e.g. here (earlier based only on Wikidata and text search of Commons): https://wikidocumentaries-demo.wmcloud.org/wikipedia/en/Hummingbird?language=en
Can you explain the use case for the faceted search a bit more? What is it that users ultimately want when, for example, they type per
in order to find human (Q5)
, or hum
to find hummingbird (Q43624)
.
Is the goal just to find the right QID? That you can also do that with the search box on the top right of https://www.wikidata.org , right. In our Information Retrieval lecture, we have an exercise, where the goal is to build a version of this with fuzzy search (that is, you can make mistakes). Here is a demo: https://qlever.cs.uni-freiburg.de/wikidata-entity-search (for example, type huming
).
If that is not the primary goal, what are the subsequent steps?
In addition to making better query builders, I have in mind using faceted search as a powerful tool for exploring big collections (museums, archives, shops) with potentially spotty metadata.
My example queries above come from the hackathon participants' test case of exploring works of art with photos available. The property counts show there are 800k such works just in Wikidata, so I can't go through them one by one. But the counts also give me the idea I could filter by e.g. the collection, location, author, material, what is depicted etc. Or also by e.g. color, but I wouldn't get that many results. Say, I want to choose what is depicted. Next, I can see I could add a filter to see e.g. 14k portraits or 4k horses. I can continue until I (hopefully) see what I want, perhaps after backtracking a few times.
There will be some complications like in the hummingbird case I want to filter by a property path instead of a direct property, but e.g. Wikidata Query Builder seems to have some knowledge of typical property paths to use ("Include related values in the search").
Interesting, thanks.
As a matter of fact, our older UIs were all faceted-search UIs, for example: https://broccoli.cs.uni-freiburg.de . You can start with a class (for example, Person
or Written Work
) and then refine from there using the facets (select a subclass, select an instance, add a property, refine via co-occurring words). Is that something of the kind you imagine?
Such UIs are easier to use, but limited in the kind of query you can ask on the data. That's why we eventually developed QLever. UI-wise, the ideas was that it can be useful in two ways:
-
You can use it to incrementally construct arbitrary SPARQL queries, as done in the QLever UI. That is very powerful, but asking too much from some users.
-
You can , with little effort, build a special-purpose UI on top of the API. This requires that the suggestions can be computed very efficiently via SPARQL queries themselves, which is the case for QLever (other SPARQL engines are not good at these kind of queries).
I knew you are working on the cutting edge, but Broccoli must've been 10 years ahead of its time!
Regarding your second point, do you happen to have an example/code of such a special-purpose UI?
We have now a first very limited but testable implementation of faceted browsing in Wikidocumentaries, see "Depictions from Wikimedia Commons" e.g. here: https://wikidocumentaries-demo.wmcloud.org/wikipedia/en/Birds?language=en
Our code is available here and any feedback is welcome especially in how to improve the Sparql queries: https://github.com/Wikidocumentaries/wikidocumentaries-ui/blob/master/src/components/topic_page/DepictingImages.vue
Quick question: Queries like https://qlever.cs.uni-freiburg.de/wikimedia-commons/MdUKbU are coming from you, right?
I am asking because for some reason the contained SERVICE
queries all take 5 + epsilon seconds and we don't yet know why (they should be much faster, since the respective queries to the Wikidata instance are fast).
If you want to ask more such queries, I would for now (until we resolve that problem) simply build a joint index for Wikdiata and Wikimedia Commons.
And out of curiosity: who is generating the traffic, is it you via tests or is it actual users?
Sorry for the delay, for some reason I didn't get a notification about your message.
Quick question: Queries like https://qlever.cs.uni-freiburg.de/wikimedia-commons/MdUKbU are comming from you, right?
Right.
I am asking because for some reason the contained
SERVICE
queries all take 5 + epsilon seconds and we don't yet know why (they should be much faster, since the respective queries to the Wikidata instance are fast).
Good to know. Something that might matter is that I currently make multiple parallel requests, perhaps they interact badly?
If you want to ask more such queries, I would for now (until we resolve that problem) simply build a joint index for Wikdiata and Wikimedia Commons.
I wanted to ask about that anyway, since the need for SERVICE makes some query features more difficult to write and some perhaps impossible: if I want to ask for ql:has-predicate of something inside SERVICE, can it take into account the restrictions caused by triples outside of SERVICE?
Also, if preferable and the necessary scripts / instructions are available, I may be able to set up an instance on a Wikimedia Cloud VPS.
And out of curiosity: who is generating the traffic, is it you via tests or is it actual users?
Probably both, and also bots. If you have User-Agent logs, you should be able to tell apart Googlebot, my dev environment (on Linux Firefox), and actual users.
And out of curiosity: who is generating the traffic, is it you via tests or is it actual users?
Probably both, and also bots. If you have User-Agent logs, you should be able to tell apart Googlebot, my dev environment (on Linux Firefox), and actual users.
Also, if you can log the Origin header of the requests, you will see which queries come from the deployed version at https://wikidocumentaries-demo.wmcloud.org/
And let me know if I should limit the number and/or complexity of the requests I'm sending.
Quick update: We have found (already yesterday) the reason why the SERVICE queries always took "5 + epsilon" seconds. The respective QLever backend ran inside of a docker container and it so happened that docker containers on that particular machine had a five second latency for any network-related request (probably due to problems with the DNS lookup). As a quickfix, the backend now runs outside of docker and the SERVICE queries are as fast as they should be, for example: https://qlever.cs.uni-freiburg.de/wikimedia-commons/fwdZ1M
I'm trying to finish a "version 0.9" of the UI, but I'm getting a lot of indeterministic 400
out-of-memory responses. I had a look at the query analyzer and two things popped out:
-
Even if there are few items (and files),
?file (wdt:P180|wdt:P921|wdt:P6243|wdt:P195|wdt:P608) ?item
is slow. It seems it would be faster to do?file ?p ?item
and filter afterwards. Is this expected? -
Even if there are few files (and images),
?file schema:url ?image
is slow:
INDEX SCAN ?file <url> ?image
Cols: ?file, ?image
Size: 87,160,263 x 2 [~ 87,160,263]
Time: 137ms [~ 87,160,263]
Full query: https://qlever.cs.uni-freiburg.de/wikimedia-commons/dGYCdH
And more in general, are there any new thoughts regarding this topic (faceted browsing of SDC+Wikidata, and my implementation of it) from your side?
I am currently traveling and will try to look at it tonight or tomorrow. Maybe @joka921 can say something about the out of memory responses?
Thank you @hannahbast!
After I wrote my previous message, the Sparql endpoint went down and now only responds 503 Service Unavailable
: https://qlever.cs.uni-freiburg.de/api/wikimedia-commons
I am sorry, I don't know what happened, but the endpoint is now up again. More tomorrow
@tuukka Thanks for your feedback. It seems like your faceted system issues a lot of queries that contain follow a similar template and only have a small variable part (That is very typical for such applications where you build some kind of frontend application that internally issues SPARQL queries). The easiest solution would be to identify the building blocks of your queries that
- Are part of (almost) every query your system issues and
- Are comparatively expensive to compute.
Given your example queries given previously in this thread, it seems like this could be the case for the complete
schema:url
predicate, and the union (wdt:P180|wdt:P921|wdt:P6243|wdt:P195|wdt:P608)
We could then precompute these building blocks, and pin them to our subtree cache, then they wouldn't have to be computed from scratch for every query that uses them. By default we pin for example the predicates for English labels, as they occur in almost every query.
Additionally we could in general try to perform some query engineering (reformulate queries in a way that is equivalent but cheaper to compute, query planning is a hard problem, and sometimes we can help the systems).
If you can identify such parts of queries that happen in many of your requests and point them out, then we can try to pin some of them to the cache and see, whether this helps your system.