snps
snps copied to clipboard
Optimize normalized snps dataframe dtypes
Update the dtype
of rsid
, chrom
, and genotype
columns to be pandas.StringDtype
as recommended here.
Also require pandas>1.0.0
.
Have you thought about using CategoricalDtype for chrom
and genotype
? See here
That's a great idea and will really help reduce memory usage for those columns.
And compared to object
, it looks like StringDtype
for the rsid
column will also use less memory.
Note that in a quick test with one of the example files, s._snps.index = s._snps.index.astype(pd.StringDtype())
reduces memory usage by ~2.5 times (very desirable). However, just using .loc
with an rsid label coerces the index back to object
dtype (e.g., s.snps.loc["rs3094315"]
).
It seems that to maintain the rsid
column as pd.StringDtype()
, either another method would have to be used to filter SNPs (e.g., s.snps.loc[s.snps.index == "rs3094315"]
) (less convenient), or astype
would have to be called after a .loc
to convert the dtype back to pd.StringDtype()
(uses more memory temporarily for when the dtype is object
).
So, the following dtypes seem like a good trade-off between memory and convenience:
Column | pandas dtype |
---|---|
rsid | object |
chrom | pd.CategoricalDtype() (ordered after sorting chroms) |
pos | pd.UInt32Dtype() |
genotype | pd.CategoricalDtype() |
Upon further investigation, it looks like object
and pd.StringDtype()
use the same amount of memory, and resetting the index dtype
as above actually just freed the memory used by a hash table that was generated when label-based lookups were performed on the rsid
index internal to snps
, e.g., to determine the build. See this issue for explanation of the hash table behavior: https://github.com/pandas-dev/pandas/issues/31197 .
So, I think to be explicit, rsid
should be pd.StringDtype()
afterall.
The pandas
issue provides ideas on how to prevent the hash table from being generated (e.g., only performing boolean indexing or not using rsid
as the index internal to snps
).