sparqlwrapper
sparqlwrapper copied to clipboard
convert query results into pandas dataframe
I am needing to convert the results from a SPARQL query into a pandas data frame. Here is the function I have written for this:
import pandas as pds
import itertools
def make_sparql_df(results):
df = pds.DataFrame() # create empty dataframe
for var in itertools.chain(results.variables):
# create a list of values
# getValues returns a list of 'Value' objects
# e.g., [Value(literal:'x'), Value(x:'y'), ...]
temp = [val.value
for val in results.getValues(var)]
# convert temp into Series, this is needed in case
# temp is an empty list, using a Series will fill
# column with NaN
df[var] = pds.Series(temp)
return df
Is there a more efficient way to do this?
There was bug in my code above. Below is an updated version for those who are interested:
from SPARQLWrapper import SPARQLWrapper, SPARQLWrapper2, JSON, JSONLD, CSV, TSV, N3, RDF, RDFXML, TURTLE
import pandas as pds
import itertools
def get_sparql_variables(results, sparql_wrapper="SPARQLWrapper2"):
return results.variables if ("sparqlwrapper2" == sparql_wrapper.lower()) else results['head']['vars']
def get_sparql_bindings(results, sparql_wrapper="SPARQLWrapper2"):
return results.bindings \
if ("sparqlwrapper2" == sparql_wrapper.lower()) else results['results']['bindings']
def get_sparql_binding_variable_value(binding, variable, sparql_wrapper="SPARQLWrapper2"):
return binding[variable].value \
if ("sparqlwrapper2" == sparql_wrapper.lower()) else binding[variable]['value']
def make_sparql_dict_list(bindings, variables, sparql_wrapper="SPARQLWrapper2"):
def binding_value(binding, var): # helper function for returning values
return \
get_sparql_binding_variable_value(binding, var, sparql_wrapper) if (var in binding) else None
dict_list = [] # list to contain dictionaries
for binding in itertools.chain(bindings):
values = [binding_value(binding, var) for var in itertools.chain(variables)]
dict_list.append(dict(zip(variables, values)))
return dict_list
def make_sparql_df(results, sparql_wrapper="SPARQLWrapper2"):
variables = get_sparql_variables(results, sparql_wrapper)
bindings = get_sparql_bindings(results, sparql_wrapper)
# create a list of dictionaries to use as data for dataframe
data_list = make_sparql_dict_list(bindings, variables, sparql_wrapper)
df = pds.DataFrame(data_list) # create dataframe from data list
return df[variables] # return dataframe with columns reordered
Hi @wdduncan
Let me share some pointers to developments around pandas and SPARQLWrapper, that maybe could be interesting for you:
https://github.com/lawlesst/sparql-dataframe https://lawlesst.github.io/notebook/sparql-dataframe.html https://github.com/SuLab/sparql_to_pandas/blob/master/SPARQL_pandas.ipynb
https://github.com/RDFLib/sparqlwrapper/pull/112
@dayures Thanks for these pointers! I will make use these!
@dayures I was looking at this notebook: https://github.com/SuLab/sparql_to_pandas/blob/master/SPARQL_pandas.ipynb
A straightforward approach for putting the SPARQL results into a data frame is:
df = pds.DataFrame(result['results']['bindings'])
df.applymap(lambda x: x['value'])
@wdduncan what type is result in your answer? I am looking to convert an rdflib.plugins.sparql.processor.SPARQLResult object to a pandas DataFrame in rdflib version 5.0.0.
Here are some things I've tried that work for different use cases:
import pandas as pd
# renders properly in notebooks, but DataFrame values are rdflib objects rather than python primatives
pd.DataFrame(results.bindings)
# converts everything to strings including missing values
pd.DataFrame(results.bindings).applymap(str).rename(columns=str)
# serialize with json and then parse
import json
results_json = results.serialize(format="json")
bindings = json.loads(results_json)["results"]["bindings"]
bindings = [{k: v["value"] for k, v in result.items()} for result in bindings]
pd.DataFrame(bindings)
It seems like consuming results from SPARQL queries into Python workflow with native Python types is a common operation for rdflib. Am I missing something in terms of an easier way to do this?
Also see this implementation in gastrodon.
@dhimmel Good question! I didn't consider that :)
I've been handling data type coercion on the client slide. Do the results from SPARQLWrapper include data type information? I forget ...
You may be able to let Pandas infer the data types. I haven't tried this.
As for RDFLib's native data types, does only work when you use RDFLib's triple store engine?
My simple solution was so far:
from pandas import DataFrame
from rdflib import Graph, URIRef
graph = Graph()
…
result = graph.query("select * {?s ?p ?o} limit 10")
DataFrame(result, columns=result.vars)
I think this is not technically an issue of the sparqlwrapper but could be some additional PandasWrapper within the RDFLib collection of projects. If someone is willing to maintain it.
Thanks @white-gecko
That is a simpler solution!
My simple solution was so far
Noting my reply at https://github.com/RDFLib/rdflib/issues/1179#issuecomment-704261776.
I realized the original issue here was about converting a SPARQLWrapper.Wrapper.QueryResult object to a DataFrame. Whereas my question is about converting a rdflib.plugins.sparql.processor.SPARQLResult to a DataFrame. Sorry about any confusion and let me know if these are actually the same task.
Note that the @white-gecko solution does not deal with the data types in literals in any way; its not wrong, just something to note. In order to at least have the columns being strings, I had to do something like this:
columns = [str(v) for v in result.vars]
df = pd.DataFrame(result, columns=columns)
But this will not take care of the values in the cells (for which you may want to call .toPython().
(dont use str(v), better use v.toPython())
Thanks @matentzn. .toPython() was the missing piece. Here's a function to convert rdflib.plugins.sparql.processor.SPARQLResult to a DataFrame that converts types:
from pandas import DataFrame
from rdflib.plugins.sparql.processor import SPARQLResult
def sparql_results_to_df(results: SPARQLResult) -> DataFrame:
"""
Export results from an rdflib SPARQL query into a `pandas.DataFrame`,
using Python types. See https://github.com/RDFLib/rdflib/issues/1179.
"""
return DataFrame(
data=([None if x is None else x.toPython() for x in row] for row in results),
columns=[str(x) for x in results.vars],
)
sparql_results_to_df(results)
Another option would be to call .toPython() on the column names, but that prefixes them with ?, which I'm guessing is not what most users want.
Note that the @white-gecko solution does not deal with the data types in literals in any way; its not wrong, just something to note.
Yes, I imagine almost all users will want to convert types when creating a DataFrame. Also noting that method is equivalent (besides column order preservation) to the first method in my comment above, which also explains why its deficient:
# renders properly in notebooks, but DataFrame values are rdflib objects rather than python primatives pd.DataFrame(results.bindings)
see https://github.com/WolfgangFahl/pyLoDStorage/issues/25 - just use pyLodStorage and you'd be all set
How to remove the namespace from the SPARQL result? the encoding method is not working
@Gautamshahi - please give more details ... Did you use pyLodStorage or are you still refering to the open issue in sparqlwrapper - what's your input and what is your query and what did you try? Since you asked a question stackoverflow might be a better place to do that see https://stackoverflow.com/questions/tagged/sparqlwrapper