scholia
scholia copied to clipboard
Test Scholia queries on other SPARQL endpoints
Is your feature request related to a problem? Please describe.
- Scholia uses the Wikidata Query Service to run SPARQL queries over the Wikidata corpus.
- The Wikidata Query Service uses Blazegraph as the backend for providing the SPARQL endpoint.
- Blazegraph is not designed for graphs much larger than about 100 million items, which is about the size of the current Wikidata
- An evaluation of Blazegraph alternatives for Wikidata is ongoing, with no clear timeline towards a solution.
Describe the solution you'd like
I'd like us to explore running Scholia on other SPARQL endpoints, Blazegraph or otherwise. We have done some of this in a past, but not in a way that would be scalable across all Scholia queries.
Describe alternatives you've considered
A relatively straightforward approach might be to build a workflow based on running Scholia via the SPARQL endpoint (default: Blazegraph again) of a dedicated Wikibase instance that holds a copy of a recent Wikidata dump. There could even be several such Wikibases, each serving a specific subset (e.g. per Scholia aspect).
Additional context
Other options would be to start exploring non-Blazegraph endpoints, e.g. https://wikidata.demo.openlinksw.com/sparql (running on Virtuoso) or https://qlever.cs.uni-freiburg.de/wikidata/ (running on QLever)
- #1721
I just created a simplified version of one of our queries - country_authors.sparql
SELECT
?author
(COUNT(DISTINCT ?citing_work) AS ?number_of_citing_works)
(SAMPLE(?organization_) AS ?organization)
(SAMPLE(?work) AS ?example_work)
WHERE {
?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:Q35 .
?work wdt:P50 ?author .
OPTIONAL { ?citing_work wdt:P2860 ?work . }
OPTIONAL {
?author wdt:P1416 | wdt:P108 ?organization_ .
?organization_ wdt:P17 wd:Q35 .
}
}
GROUP BY ?author
It times out on Wikidata, fails on QLever and executes on that Virtuoso instance.



The query runs successfully on some of our endpoints
date;sparqlquery -qn authorsCitingWork -en blazegraph -f github;date
- blazegraph 2018 instance (13 secs for ~786 results) Fr 22. Jul 13:41:13 CEST 2022 Fr 22. Jul 13:41:26 CEST 2022
- jena 2020 instance ( for ~10117 results) Fr 22. Jul 13:39:32 CEST 2022 - still running via command line will report later
- stardog 2022 instance (108 secs for ~14266 results) Fr 22. Jul 13:37:10 CEST 2022 - Fr 22. Jul 13:39:02 CEST 2022
see https://github.com/ad-freiburg/qlever/issues/859
Virtuoso-on-AWS: https://wikidata.demo.openlinksw.com/sparql
(Does not support the Wikidata blazegraph functions)
https://ceur-ws.org/Vol-3262/paper9.pdf and https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData have a list of candidates. I also intend to talk to the wikidata team on the next meeting and would love to have a proper blazegraph mirror running at our RWTH Aachen i5 http://wikidata.dbis.rwth-aachen.de/ machine which should be suitable for the task with 256 GB RAM and 10 TB SSD. I never got a proper blazegraph mirror endpoint with all necessary special services running in the past 6 years that i have been attempting to get my own copy of wikidata running.
Oh, you're in Aachen?