GEOparse
GEOparse copied to clipboard
Phenotype data does not use pandas dtype inference
By skipping the read_csv function, we lose the detection of nan values, so columns that are numeric are coded as objects.
ie
import GEOparse
geo = GEOparse.get_GEO("GSE112676")
geo.phenotype_data["characteristics_ch1.3.age_onset"]
gives
GSM3076582 72.69
GSM3076584 66.97
GSM3076586 73.73
GSM3076588 NA
GSM3076590 NA
...
GSM3078502 74.88
GSM3078503 73.57
GSM3078505 71.29
GSM3078507 61.84
GSM3078510 74.49
Name: characteristics_ch1.3.age_onset, Length: 741, dtype: object
So despite being "NA" strings, they are not interpreted as being consistent with floats.
my fix is something like this:
from io import StringIO
out = StringIO()
pheno.to_csv(out)
pheno = pd.read_csv(StringIO(out.getvalue()), index_col=0)
I can put in a quick PR, but it feels a little crude to do this, but I haven't been able to find a more elegant way.
Thanks for reporting. Let me think how to do this - maybe a PR would be good to do so we can test it.