GEOparse icon indicating copy to clipboard operation
GEOparse copied to clipboard

Phenotype data does not use pandas dtype inference

Open hardingnj opened this issue 4 years ago • 1 comments

By skipping the read_csv function, we lose the detection of nan values, so columns that are numeric are coded as objects.

ie

import GEOparse

geo = GEOparse.get_GEO("GSE112676")

geo.phenotype_data["characteristics_ch1.3.age_onset"]

gives

GSM3076582    72.69
GSM3076584    66.97
GSM3076586    73.73
GSM3076588       NA
GSM3076590       NA
              ...  
GSM3078502    74.88
GSM3078503    73.57
GSM3078505    71.29
GSM3078507    61.84
GSM3078510    74.49
Name: characteristics_ch1.3.age_onset, Length: 741, dtype: object

So despite being "NA" strings, they are not interpreted as being consistent with floats.

my fix is something like this:

from io import StringIO
out = StringIO()
pheno.to_csv(out)
pheno = pd.read_csv(StringIO(out.getvalue()), index_col=0)

I can put in a quick PR, but it feels a little crude to do this, but I haven't been able to find a more elegant way.

hardingnj avatar Sep 22 '21 10:09 hardingnj

Thanks for reporting. Let me think how to do this - maybe a PR would be good to do so we can test it.

guma44 avatar Oct 19 '21 12:10 guma44