sparql-dataframe icon indicating copy to clipboard operation
sparql-dataframe copied to clipboard

Adaptation of sparql_dataframe to Wikidata

Open lbocken opened this issue 2 years ago • 7 comments

Hello,

I am trying to extract dataframes from queries in Wikidata.

For instance, this code from an example in Wikidata works to extract dictionary of countries:

`# pip install sparqlwrapper # https://rdflib.github.io/sparqlwrapper/ import sparql_dataframe import sys from SPARQLWrapper import SPARQLWrapper, JSON

endpoint_url = "https://query.wikidata.org/sparql"

query = """#Countries SELECT ?item ?itemLabel WHERE { ?item wdt:P31 wd:Q6256. SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } }"""

def get_results(endpoint_url, query): user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1]) # TODO adjust user agent; see https://w.wiki/CX6 sparql = SPARQLWrapper(endpoint_url, agent=user_agent) sparql.setQuery(query) sparql.setReturnFormat(JSON) return sparql.query().convert()

results = get_results(endpoint_url, query)

for result in results["results"]["bindings"]: print(result) `

When I do that : df = sparql_dataframe.get(endpoint_url, query)

I receive this error:

C:\ProgramData\Anaconda3\lib\site-packages\SPARQLWrapper\Wrapper.py:1315: RuntimeWarning: Format requested was CSV, but XML (application/sparql-results+xml;charset=utf-8) has been returned by the endpoint warnings.warn(message % (requested.upper(), format_name, mime), RuntimeWarning)


AttributeError Traceback (most recent call last) in ----> 1 df = sparql_dataframe.get(endpoint_url, query)

C:\ProgramData\Anaconda3\lib\site-packages\sparql_dataframe\sparql_dataframe.py in get_sparql_dataframe(endpoint, query, post) 28 sparql.setReturnFormat(CSV) 29 results = sparql.query().convert() ---> 30 _csv = StringIO(results.decode('utf-8')) 31 return pd.read_csv(_csv, sep=",")

AttributeError: 'Document' object has no attribute 'decode'

lbocken avatar Jul 26 '21 16:07 lbocken

Hello,

Try passing post=True. E.g.:

sparql_dataframe.get(endpoint_url, query, post=True)

You can see in the unit tests that queries against Wikidata should work fine with post=True: https://github.com/lawlesst/sparql-dataframe/blob/master/tests/test_sparql_dataframe.py#L65

lawlesst avatar Jul 26 '21 17:07 lawlesst


HTTPError Traceback (most recent call last) in ----> 1 df = sparql_dataframe.get(endpoint_url, query, post = True) 2 df

C:\ProgramData\Anaconda3\lib\site-packages\sparql_dataframe\sparql_dataframe.py in get_sparql_dataframe(endpoint, query, post) 27 28 sparql.setReturnFormat(CSV) ---> 29 results = sparql.query().convert() 30 _csv = StringIO(results.decode('utf-8')) 31 return pd.read_csv(_csv, sep=",")

C:\ProgramData\Anaconda3\lib\site-packages\SPARQLWrapper\Wrapper.py in query(self) 1105 :rtype: :class:QueryResult instance 1106 """ -> 1107 return QueryResult(self._query()) 1108 1109 def queryAndConvert(self):

C:\ProgramData\Anaconda3\lib\site-packages\SPARQLWrapper\Wrapper.py in _query(self) 1085 raise EndPointInternalError(e.read()) 1086 else: -> 1087 raise e 1088 1089 def query(self):

C:\ProgramData\Anaconda3\lib\site-packages\SPARQLWrapper\Wrapper.py in _query(self) 1071 response = urlopener(request, timeout=self.timeout) 1072 else: -> 1073 response = urlopener(request) 1074 return response, self.returnFormat 1075 except urllib.error.HTTPError as e:

C:\ProgramData\Anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context) 220 else: 221 opener = _opener --> 222 return opener.open(url, data, timeout) 223 224 def install_opener(opener):

C:\ProgramData\Anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout) 529 for processor in self.process_response.get(protocol, []): 530 meth = getattr(processor, meth_name) --> 531 response = meth(req, response) 532 533 return response

C:\ProgramData\Anaconda3\lib\urllib\request.py in http_response(self, request, response) 638 # request was successfully received, understood, and accepted. 639 if not (200 <= code < 300): --> 640 response = self.parent.error( 641 'http', request, response, code, msg, hdrs) 642

C:\ProgramData\Anaconda3\lib\urllib\request.py in error(self, proto, *args) 567 if http_err: 568 args = (dict, 'default', 'http_error_default') + orig_args --> 569 return self._call_chain(*args) 570 571 # XXX probably also want an abstract factory that knows when it makes

C:\ProgramData\Anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args) 500 for handler in handlers: 501 func = getattr(handler, meth_name) --> 502 result = func(*args) 503 if result is not None: 504 return result

C:\ProgramData\Anaconda3\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs) 647 class HTTPDefaultErrorHandler(BaseHandler): 648 def http_error_default(self, req, fp, code, msg, hdrs): --> 649 raise HTTPError(req.full_url, code, msg, hdrs, fp) 650 651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

lbocken avatar Jul 26 '21 20:07 lbocken

I think that's an error returned by the actual Wikidata SPARQL endpoint. It aggressively rate limits.

lawlesst avatar Jul 26 '21 20:07 lawlesst

Hello,

Try passing post=True. E.g.:

sparql_dataframe.get(endpoint_url, query, post=True)

You can see in the unit tests that queries against Wikidata should work fine with post=True: https://github.com/lawlesst/sparql-dataframe/blob/master/tests/test_sparql_dataframe.py#L65

How would you read the query saved into a separate file? Thanks for your help !

lbockenrs avatar Nov 22 '21 16:11 lbockenrs

If your queries are saved in a text file, then you would just read them in like any other text file in Python and save them to a query variable that you would use with sparql_dataframe.get.

Here's a tutorial on reading and writing files in Python: https://realpython.com/read-write-files-python/#reading-and-writing-opened-files

lawlesst avatar Nov 22 '21 17:11 lawlesst

This works :

import sparql_dataframe endpoint_url = "https://query.wikidata.org/sparql" with open('query.rq', 'r') as file: query = file.read() df = sparql_dataframe.get(endpoint_url, query, post = True) df

lbockenrs avatar Nov 22 '21 18:11 lbockenrs

Just had the same issue issue querying wikidata. First thought, it might be caused by a version change (SPARQLWrapper was installed in version 2.0.0). It now already contains get_sparql_dataframe, so the code below was successful.

Nevertheless, thanks for creating this lib which made it directly into the wrapper!

from SPARQLWrapper import get_sparql_dataframe

endpoint = "https://query.wikidata.org/sparql"

query = """#Countries
SELECT ?item ?itemLabel
WHERE {
  ?item wdt:P31 wd:Q6256.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""
df = get_sparql_dataframe(endpoint, query)

hbruch avatar Feb 26 '23 21:02 hbruch