sparqlwrapper icon indicating copy to clipboard operation
sparqlwrapper copied to clipboard

convert query results into pandas dataframe

Open wdduncan opened this issue 7 years ago • 16 comments

I am needing to convert the results from a SPARQL query into a pandas data frame. Here is the function I have written for this:

import pandas as pds
import itertools

def make_sparql_df(results):
    df = pds.DataFrame() # create empty dataframe

    for var in itertools.chain(results.variables):
        # create a list of values
        # getValues returns a list of 'Value' objects
        # e.g., [Value(literal:'x'), Value(x:'y'), ...]
        temp = [val.value
                for val in results.getValues(var)]

        # convert temp into Series, this is needed in case
        # temp is an empty list, using a Series will fill
        # column with NaN
        df[var] = pds.Series(temp)

    return df

Is there a more efficient way to do this?

wdduncan avatar Oct 14 '18 02:10 wdduncan

There was bug in my code above. Below is an updated version for those who are interested:

from SPARQLWrapper import SPARQLWrapper, SPARQLWrapper2, JSON, JSONLD, CSV, TSV, N3, RDF, RDFXML, TURTLE
import pandas as pds
import itertools

def get_sparql_variables(results, sparql_wrapper="SPARQLWrapper2"):
    return results.variables if ("sparqlwrapper2" == sparql_wrapper.lower()) else results['head']['vars']
   

def get_sparql_bindings(results, sparql_wrapper="SPARQLWrapper2"):
    return results.bindings \
              if ("sparqlwrapper2" == sparql_wrapper.lower()) else results['results']['bindings']
   
   
def get_sparql_binding_variable_value(binding, variable, sparql_wrapper="SPARQLWrapper2"):
    return binding[variable].value \
             if ("sparqlwrapper2" == sparql_wrapper.lower()) else binding[variable]['value']
   

def make_sparql_dict_list(bindings, variables, sparql_wrapper="SPARQLWrapper2"):
    def binding_value(binding, var): # helper function for returning values
        return \
            get_sparql_binding_variable_value(binding, var, sparql_wrapper) if (var in binding) else None

    dict_list = []  # list to contain dictionaries
    for binding in itertools.chain(bindings):
        values = [binding_value(binding, var) for var in itertools.chain(variables)]
        dict_list.append(dict(zip(variables, values)))

    return dict_list


def make_sparql_df(results, sparql_wrapper="SPARQLWrapper2"):
    variables = get_sparql_variables(results, sparql_wrapper)
    bindings = get_sparql_bindings(results, sparql_wrapper)

    # create a list of dictionaries to use as data for dataframe
    data_list = make_sparql_dict_list(bindings, variables, sparql_wrapper)

    df = pds.DataFrame(data_list) # create dataframe from data list
    return df[variables] # return dataframe with columns reordered

wdduncan avatar Nov 06 '18 03:11 wdduncan

Hi @wdduncan

Let me share some pointers to developments around pandas and SPARQLWrapper, that maybe could be interesting for you:

https://github.com/lawlesst/sparql-dataframe https://lawlesst.github.io/notebook/sparql-dataframe.html https://github.com/SuLab/sparql_to_pandas/blob/master/SPARQL_pandas.ipynb

https://github.com/RDFLib/sparqlwrapper/pull/112

dayures avatar Apr 18 '19 10:04 dayures

@dayures Thanks for these pointers! I will make use these!

wdduncan avatar Apr 19 '19 15:04 wdduncan

@dayures I was looking at this notebook: https://github.com/SuLab/sparql_to_pandas/blob/master/SPARQL_pandas.ipynb

A straightforward approach for putting the SPARQL results into a data frame is:

df = pds.DataFrame(result['results']['bindings'])
df.applymap(lambda x: x['value'])

wdduncan avatar Apr 19 '19 17:04 wdduncan

@wdduncan what type is result in your answer? I am looking to convert an rdflib.plugins.sparql.processor.SPARQLResult object to a pandas DataFrame in rdflib version 5.0.0.

Here are some things I've tried that work for different use cases:

import pandas as pd

# renders properly in notebooks, but DataFrame values are rdflib objects rather than python primatives
pd.DataFrame(results.bindings)

# converts everything to strings including missing values
pd.DataFrame(results.bindings).applymap(str).rename(columns=str)

# serialize with json and then parse
import json
results_json = results.serialize(format="json")
bindings = json.loads(results_json)["results"]["bindings"]
bindings = [{k: v["value"] for k, v in result.items()} for result in bindings]
pd.DataFrame(bindings)

It seems like consuming results from SPARQL queries into Python workflow with native Python types is a common operation for rdflib. Am I missing something in terms of an easier way to do this?

Also see this implementation in gastrodon.

dhimmel avatar Sep 28 '20 20:09 dhimmel

@dhimmel Good question! I didn't consider that :)
I've been handling data type coercion on the client slide. Do the results from SPARQLWrapper include data type information? I forget ... You may be able to let Pandas infer the data types. I haven't tried this.

As for RDFLib's native data types, does only work when you use RDFLib's triple store engine?

wdduncan avatar Sep 28 '20 21:09 wdduncan

My simple solution was so far:

from pandas import DataFrame
from rdflib import Graph, URIRef

graph = Graph()
…
result = graph.query("select * {?s ?p ?o} limit 10")

DataFrame(result, columns=result.vars)

I think this is not technically an issue of the sparqlwrapper but could be some additional PandasWrapper within the RDFLib collection of projects. If someone is willing to maintain it.

white-gecko avatar Oct 06 '20 08:10 white-gecko

Thanks @white-gecko
That is a simpler solution!

wdduncan avatar Oct 06 '20 11:10 wdduncan

My simple solution was so far

Noting my reply at https://github.com/RDFLib/rdflib/issues/1179#issuecomment-704261776.

I realized the original issue here was about converting a SPARQLWrapper.Wrapper.QueryResult object to a DataFrame. Whereas my question is about converting a rdflib.plugins.sparql.processor.SPARQLResult to a DataFrame. Sorry about any confusion and let me know if these are actually the same task.

dhimmel avatar Oct 06 '20 13:10 dhimmel

Note that the @white-gecko solution does not deal with the data types in literals in any way; its not wrong, just something to note. In order to at least have the columns being strings, I had to do something like this:

columns = [str(v) for v in result.vars]
df = pd.DataFrame(result, columns=columns)

But this will not take care of the values in the cells (for which you may want to call .toPython().

matentzn avatar Oct 06 '20 13:10 matentzn

(dont use str(v), better use v.toPython())

matentzn avatar Oct 06 '20 13:10 matentzn

Thanks @matentzn. .toPython() was the missing piece. Here's a function to convert rdflib.plugins.sparql.processor.SPARQLResult to a DataFrame that converts types:

from pandas import DataFrame
from rdflib.plugins.sparql.processor import SPARQLResult

def sparql_results_to_df(results: SPARQLResult) -> DataFrame:
    """
    Export results from an rdflib SPARQL query into a `pandas.DataFrame`,
    using Python types. See https://github.com/RDFLib/rdflib/issues/1179.
    """
    return DataFrame(
        data=([None if x is None else x.toPython() for x in row] for row in results),
        columns=[str(x) for x in results.vars],
    )

sparql_results_to_df(results)

Another option would be to call .toPython() on the column names, but that prefixes them with ?, which I'm guessing is not what most users want.

Note that the @white-gecko solution does not deal with the data types in literals in any way; its not wrong, just something to note.

Yes, I imagine almost all users will want to convert types when creating a DataFrame. Also noting that method is equivalent (besides column order preservation) to the first method in my comment above, which also explains why its deficient:

# renders properly in notebooks, but DataFrame values are rdflib objects rather than python primatives
pd.DataFrame(results.bindings)

dhimmel avatar Oct 06 '20 14:10 dhimmel

see https://github.com/WolfgangFahl/pyLoDStorage/issues/25 - just use pyLodStorage and you'd be all set

WolfgangFahl avatar Aug 16 '21 18:08 WolfgangFahl

How to remove the namespace from the SPARQL result? the encoding method is not working

Gautamshahi avatar Apr 09 '23 22:04 Gautamshahi

@Gautamshahi - please give more details ... Did you use pyLodStorage or are you still refering to the open issue in sparqlwrapper - what's your input and what is your query and what did you try? Since you asked a question stackoverflow might be a better place to do that see https://stackoverflow.com/questions/tagged/sparqlwrapper

WolfgangFahl avatar Apr 10 '23 17:04 WolfgangFahl