gastrodon icon indicating copy to clipboard operation
gastrodon copied to clipboard

Problems with column types

Open iipr opened this issue 6 years ago • 0 comments

After playing around a bit with gastrodon, I think I have bumped into some problems regarding column types. To reproduce:

Preliminaries

from gastrodon import RemoteEndpoint, inline
import pandas as pd

prefixes = inline("""
    @prefix : <http://dbpedia.org/resource/> .
    @prefix dbp: <http://dbpedia.org/ontology/> .
    @prefix pr: <http://dbpedia.org/property/> .
    @prefix foaf: <http://xmlns.com/foaf/0.1/> .
""").graph
endpoint = RemoteEndpoint(
    "http://dbpedia.org/sparql/"
    ,default_graph="http://dbpedia.org"
    ,prefixes=prefixes
    ,base_uri="http://dbpedia.org/resource/"
)

Error with dates

endpoint.select("""
SELECT DISTINCT ?personName ?bDay
WHERE {
    ?person a dbp:Person .
    ?person foaf:name ?nombrePersona .
    ?person dbp:birthDate ?bDay .
    }
    LIMIT 10
""")

Output:

Traceback (most recent call last):
  File "<stdin>", line 9, in <module>
  File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 502, in select
    frame=self._dataframe(result)
  File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 397, in _dataframe
    column[key] = self._normalize_column_type(column[key])
  File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 376, in _normalize_column_type
    return [None if x==None else int(x) for x in column]
  File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 376, in <listcomp>
    return [None if x==None else int(x) for x in column]
TypeError: int() argument must be a string, a bytes-like object or a number, not 'datetime.date'

Issue (casting floats)

endpoint.select("""
SELECT DISTINCT ?starName ?mass
WHERE {
    ?star a dbp:Star .
    ?star foaf:name ?starName .
    ?star pr:mass ?mass
} LIMIT 1
""")

Output:

    starName  mass
0  61 Cygni     0

Expected output:

    starName  mass
0  61 Cygni   0.63

(see this)

Possible cause

I believe that they are coming from _normalize_column_type:

  1. pd.datetime is not considered, so when trying to do int(x) where x is a pd.datetime, the above error appears.
  2. If all elements in the column are float type, they are silently casted into int, as shown in the issue above.

My question now is: is it really necessary to normalize the columns? pandas is usually smart enough to accommodate column types and cast if needed. If I skip the _normalize_column_type() in the code, in the previous issue with the stars example, the mass is not casted to int, and if needed to cast to str, it does:

endpoint.select("""
SELECT DISTINCT ?starName ?mass
WHERE {
    ?star a dbp:Star .
    ?star foaf:name ?starName .
    ?star pr:mass ?mass
} LIMIT 100
""").head()

_.mass.dtype

Output:

      starName          mass
0     61 Cygni          0.63
1     61 Cygni           0.7
2  70 Virginis          1.12
3  70 Virginis  >7.49 ± 0.61
4      Albireo           3.2

dtype('O')

Python 3.6.6
gastrodon 0.9.3
pandas 0.23.4

iipr avatar Apr 26 '19 11:04 iipr