snps Optimize normalized snps dataframe dtypes

Update the dtype of rsid, chrom, and genotype columns to be pandas.StringDtype as recommended here.

Also require pandas>1.0.0.

Oct 31 '20 04:10 apriha

Have you thought about using CategoricalDtype for chrom and genotype ? See here

Nov 06 '20 17:11 afaulconbridge

That's a great idea and will really help reduce memory usage for those columns.

And compared to object, it looks like StringDtype for the rsid column will also use less memory.

Nov 07 '20 06:11 apriha

Note that in a quick test with one of the example files, s._snps.index = s._snps.index.astype(pd.StringDtype()) reduces memory usage by ~2.5 times (very desirable). However, just using .loc with an rsid label coerces the index back to object dtype (e.g., s.snps.loc["rs3094315"]).

It seems that to maintain the rsid column as pd.StringDtype(), either another method would have to be used to filter SNPs (e.g., s.snps.loc[s.snps.index == "rs3094315"]) (less convenient), or astype would have to be called after a .loc to convert the dtype back to pd.StringDtype() (uses more memory temporarily for when the dtype is object).

So, the following dtypes seem like a good trade-off between memory and convenience:

Column	pandas dtype
rsid	object
chrom	pd.CategoricalDtype() (ordered after sorting chroms)
pos	pd.UInt32Dtype()
genotype	pd.CategoricalDtype()

Nov 11 '20 07:11 apriha

Upon further investigation, it looks like object and pd.StringDtype() use the same amount of memory, and resetting the index dtype as above actually just freed the memory used by a hash table that was generated when label-based lookups were performed on the rsid index internal to snps, e.g., to determine the build. See this issue for explanation of the hash table behavior: https://github.com/pandas-dev/pandas/issues/31197 .

So, I think to be explicit, rsid should be pd.StringDtype() afterall.

The pandas issue provides ideas on how to prevent the hash table from being generated (e.g., only performing boolean indexing or not using rsid as the index internal to snps).

Nov 22 '20 06:11 apriha

snps snps copied to clipboard

Optimize normalized snps dataframe dtypes

snps
snps copied to clipboard