scholia icon indicating copy to clipboard operation
scholia copied to clipboard

Test Scholia queries on other SPARQL endpoints

Open Daniel-Mietchen opened this issue 3 years ago • 2 comments

Is your feature request related to a problem? Please describe.

  • Scholia uses the Wikidata Query Service to run SPARQL queries over the Wikidata corpus.
  • The Wikidata Query Service uses Blazegraph as the backend for providing the SPARQL endpoint.
  • Blazegraph is not designed for graphs much larger than about 100 million items, which is about the size of the current Wikidata
  • An evaluation of Blazegraph alternatives for Wikidata is ongoing, with no clear timeline towards a solution.

Describe the solution you'd like

I'd like us to explore running Scholia on other SPARQL endpoints, Blazegraph or otherwise. We have done some of this in a past, but not in a way that would be scalable across all Scholia queries.

Describe alternatives you've considered

A relatively straightforward approach might be to build a workflow based on running Scholia via the SPARQL endpoint (default: Blazegraph again) of a dedicated Wikibase instance that holds a copy of a recent Wikidata dump. There could even be several such Wikibases, each serving a specific subset (e.g. per Scholia aspect).

Additional context

Other options would be to start exploring non-Blazegraph endpoints, e.g. https://wikidata.demo.openlinksw.com/sparql (running on Virtuoso) or https://qlever.cs.uni-freiburg.de/wikidata/ (running on QLever)

  • #1721

Daniel-Mietchen avatar Jul 21 '22 22:07 Daniel-Mietchen

I just created a simplified version of one of our queries - country_authors.sparql

SELECT
?author 
(COUNT(DISTINCT ?citing_work) AS ?number_of_citing_works)
(SAMPLE(?organization_) AS ?organization)
(SAMPLE(?work) AS ?example_work)
WHERE {
  ?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:Q35 .
  ?work wdt:P50 ?author .
  OPTIONAL { ?citing_work wdt:P2860 ?work . }
  OPTIONAL {
    ?author wdt:P1416 | wdt:P108 ?organization_ .
    ?organization_ wdt:P17 wd:Q35 .
  }
}
GROUP BY ?author 

It times out on Wikidata, fails on QLever and executes on that Virtuoso instance. Screenshot 2022-07-22 at 00-33-48 Wikidata Query Service

Screenshot 2022-07-22 at 00-32-30 The QLever SPARQL engine fast scalable with autocompletion and text search

Screenshot from 2022-07-22 00-31-54

Daniel-Mietchen avatar Jul 21 '22 22:07 Daniel-Mietchen

The query runs successfully on some of our endpoints

date;sparqlquery -qn authorsCitingWork -en blazegraph -f github;date
  • blazegraph 2018 instance (13 secs for ~786 results) Fr 22. Jul 13:41:13 CEST 2022 Fr 22. Jul 13:41:26 CEST 2022
  • jena 2020 instance ( for ~10117 results) Fr 22. Jul 13:39:32 CEST 2022 - still running via command line will report later
  • stardog 2022 instance (108 secs for ~14266 results) Fr 22. Jul 13:37:10 CEST 2022 - Fr 22. Jul 13:39:02 CEST 2022

WolfgangFahl avatar Jul 22 '22 11:07 WolfgangFahl

see https://github.com/ad-freiburg/qlever/issues/859

WolfgangFahl avatar Jan 10 '23 04:01 WolfgangFahl

Virtuoso-on-AWS: https://wikidata.demo.openlinksw.com/sparql

(Does not support the Wikidata blazegraph functions)

egonw avatar Mar 10 '23 05:03 egonw

https://ceur-ws.org/Vol-3262/paper9.pdf and https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData have a list of candidates. I also intend to talk to the wikidata team on the next meeting and would love to have a proper blazegraph mirror running at our RWTH Aachen i5 http://wikidata.dbis.rwth-aachen.de/ machine which should be suitable for the task with 256 GB RAM and 10 TB SSD. I never got a proper blazegraph mirror endpoint with all necessary special services running in the past 6 years that i have been attempting to get my own copy of wikidata running.

WolfgangFahl avatar Mar 10 '23 07:03 WolfgangFahl

Oh, you're in Aachen?

egonw avatar Mar 10 '23 09:03 egonw