SPARQLWrapper does not work for `CONSTRUCT` and `DESCRIBE` queries on the UniProt SPARQL endpoint which is Virtuoso
When running any CONSTRUCT or DESCRIBE query on the UniProt SPARQL endpoint https://sparql.uniprot.org/sparql/, whatever the return format asked (XML, turtle) SPARQLWrapper fails to resolve the query
Code to reproduce:
When asking for XML at least an error is thrown:
from SPARQLWrapper import TURTLE, XML, SPARQLWrapper
query = """PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
CONSTRUCT
{
?protein a up:HumanProtein .
}
WHERE
{
?protein a up:Protein .
?protein up:organism taxon:9606
} LIMIT 10"""
sparql_endpoint = SPARQLWrapper("https://sparql.uniprot.org/sparql/")
sparql_endpoint.setReturnFormat(XML)
sparql_endpoint.setQuery(query)
results = sparql_endpoint.query().convert()
print(results)
Error message:
ExpatError Traceback (most recent call last)
Cell In[8], line 20
17 # sparql_endpoint.setReturnFormat(TURTLE)
18 sparql_endpoint.setQuery(query)
---> 20 results = sparql_endpoint.query().convert()
21 print(results)
File ~/dev/.venv/lib/python3.10/site-packages/SPARQLWrapper/Wrapper.py:1190, in QueryResult.convert(self)
1188 if _content_type_in_list(ct, _SPARQL_XML):
1189 _validate_format("XML", [XML], ct, self.requestedFormat)
-> 1190 return self._convertXML()
1191 elif _content_type_in_list(ct, _XML):
1192 _validate_format("XML", [XML], ct, self.requestedFormat)
File ~/dev/.venv/lib/python3.10/site-packages/SPARQLWrapper/Wrapper.py:1073, in QueryResult._convertXML(self)
1065 def _convertXML(self) -> Document:
1066 """
1067 Convert an XML result into a Python dom tree. This method can be overwritten in a
1068 subclass for a different conversion method.
(...)
1071 :rtype: :class:`xml.dom.minidom.Document`
1072 """
-> 1073 doc = parse(self.response)
1074 rdoc = cast(Document, doc)
...
--> 211 parser.Parse(b"", True)
212 except ParseEscape:
213 pass
ExpatError: no element found: line 1, column 0
When asking for turtle, SPARQLWrapper does not even throw an error:
from SPARQLWrapper import TURTLE, XML, SPARQLWrapper
query = """PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
CONSTRUCT
{
?protein a up:HumanProtein .
}
WHERE
{
?protein a up:Protein .
?protein up:organism taxon:9606
} LIMIT 10"""
sparql_endpoint = SPARQLWrapper("https://sparql.uniprot.org/sparql/")
# sparql_endpoint.setReturnFormat(XML)
sparql_endpoint.setReturnFormat(TURTLE)
sparql_endpoint.setQuery(query)
results = sparql_endpoint.query().convert()
print(results)
Printing results gives HTML: b'<!DOCTYPE html SYSTEM "about:legacy-compat">\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head><title>UniProt</title>......
UniProt uses OpenLink Virtuoso and supports the SPARQL 1.1 Standard.
Using requests with the most logical config to request a SPARQL endpoint just works, so the problem is on SPARQLWrapper doing weird things internally:
import requests
from rdflib import Graph
query = """PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
CONSTRUCT
{
?protein a up:HumanProtein .
}
WHERE
{
?protein a up:Protein .
?protein up:organism taxon:9606
} LIMIT 10"""
response = requests.post(
"https://sparql.uniprot.org/sparql/",
headers={
"Accept": "text/turtle"
},
data={
"query": query
},
timeout=60,
)
response.raise_for_status()
g = Graph()
g.parse(data=response.text, format="turtle")
print(response.text)
print(len(g))
In bonus we get basic features like timeout working! (the .setTimeout() option from SPARQLWrapper does not work at all, at least for UniProt endpoint, but this should go in another issue)
UniProt is not pure virtuoso and has some middleware that expects accept headers to ask for an rdf format if using describe and or construct.
@JervenBolleman SPARQLWrapper also fails to run SELECT queries to SwissLipids https://beta.sparql.swisslipids.org/
Error 500 Internal Server Error</h1><p>The server was not able to handle your request.:
from SPARQLWrapper import XML, SPARQLWrapper, JSON
query = """PREFIX sh: <http://www.w3.org/ns/shacl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?comment ?query
WHERE
{
?sq a sh:SPARQLExecutable ;
rdfs:label|rdfs:comment ?comment ;
sh:select|sh:ask|sh:construct|sh:describe ?query .
}"""
sparql_endpoint = SPARQLWrapper("https://beta.sparql.swisslipids.org/")
sparql_endpoint.setReturnFormat(XML)
sparql_endpoint.setTimeout(60)
sparql_endpoint.setQuery(query)
results = sparql_endpoint.query().convert()
print(results)
With requests it works:
import requests
query = """PREFIX sh: <http://www.w3.org/ns/shacl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?comment ?query
WHERE
{
?sq a sh:SPARQLExecutable ;
rdfs:label|rdfs:comment ?comment ;
sh:select|sh:ask|sh:construct|sh:describe ?query .
}"""
response = requests.post(
"https://beta.sparql.swisslipids.org/",
headers={
"Accept": "application/json",
"User-agent": "sparqlwrapper 2.0.1a0 (rdflib.github.io/sparqlwrapper)"
},
data={
"query": query
},
timeout=60,
)
try:
response.raise_for_status()
print(response.json())
except requests.exceptions.HTTPError as e:
print(e)
print(response.text)
@JervenBolleman I have found out why, it is due to the query params SPARQLWrapper is adding by default: &format=xml&output=xml&results=xml in particular &format=xml
I have checked the tests of SPARQLWrapper and for every single known endpoints, only using content negotiation always works. Whereas using content negotiation + query params fails on many endpoints:
- Stardog
- Wikidata blazegraph, allegrograph agrovoc, fuseki2 when construct return formats are turtle, jsonld or n3 (oh yes because it is depends on what you ask to make it more tricky to use)
So basically the default onlyConneg (conneg = content negotiation...) settings of SPARQLWrapper is making SPARQLWrapper inconsistently fails on about half of the public endpoints it is test on. And to add cherry on the cake it is not even documented on the readme (and with this super obvious name I did not even know what it was about before I looked into the code).
Meanwhile switching only content negotiation to true by default will just make it work out of the box.
I know that changing defaults might create breaking changes for some people (we can upgrade to v3), but at some points it is needed to have a usable piece of software. It really does not make sense to keep adding query params when content negotiation is much more reliable and elegant.
I have push some fixes and improvements to the tests, so now they run. But the repository seems to be not maintained anymore.
I'll create and publish a fork if this does not evolves. It would be nice to have a decent library to run SPARQL queries from python. Right now the defaults are broken, the documentation is lacking the most important settings (there are not a lot of settings, but still they are missing from docs), and the parameters are not even understandable (onlyConneg...)