snps icon indicating copy to clipboard operation
snps copied to clipboard

Optimize normalized snps dataframe dtypes

Open apriha opened this issue 4 years ago • 4 comments

Update the dtype of rsid, chrom, and genotype columns to be pandas.StringDtype as recommended here.

Also require pandas>1.0.0.

apriha avatar Oct 31 '20 04:10 apriha

Have you thought about using CategoricalDtype for chrom and genotype ? See here

afaulconbridge avatar Nov 06 '20 17:11 afaulconbridge

That's a great idea and will really help reduce memory usage for those columns.

And compared to object, it looks like StringDtype for the rsid column will also use less memory.

apriha avatar Nov 07 '20 06:11 apriha

Note that in a quick test with one of the example files, s._snps.index = s._snps.index.astype(pd.StringDtype()) reduces memory usage by ~2.5 times (very desirable). However, just using .loc with an rsid label coerces the index back to object dtype (e.g., s.snps.loc["rs3094315"]).

It seems that to maintain the rsid column as pd.StringDtype(), either another method would have to be used to filter SNPs (e.g., s.snps.loc[s.snps.index == "rs3094315"]) (less convenient), or astype would have to be called after a .loc to convert the dtype back to pd.StringDtype() (uses more memory temporarily for when the dtype is object).

So, the following dtypes seem like a good trade-off between memory and convenience:

Column pandas dtype
rsid object
chrom pd.CategoricalDtype() (ordered after sorting chroms)
pos pd.UInt32Dtype()
genotype pd.CategoricalDtype()

apriha avatar Nov 11 '20 07:11 apriha

Upon further investigation, it looks like object and pd.StringDtype() use the same amount of memory, and resetting the index dtype as above actually just freed the memory used by a hash table that was generated when label-based lookups were performed on the rsid index internal to snps, e.g., to determine the build. See this issue for explanation of the hash table behavior: https://github.com/pandas-dev/pandas/issues/31197 .

So, I think to be explicit, rsid should be pd.StringDtype() afterall.

The pandas issue provides ideas on how to prevent the hash table from being generated (e.g., only performing boolean indexing or not using rsid as the index internal to snps).

apriha avatar Nov 22 '20 06:11 apriha