pycisTopic
pycisTopic copied to clipboard
Avoid making dense matrix in `create_cistopic_object_from_fragments` reduces memory 50X [PERFORMANCE]
I'm running pycisTopic but create_cistopic_object_from_fragments
has significant memory usage.
As has already been discussed (e.g. see https://github.com/aertslab/pycisTopic/issues/14), the main bottleneck comes from the following lines:
fragment_matrix = (
counts_df.groupby(["Name", "regionID"], sort=False, observed=True)
.size()
.unstack(level="Name", fill_value=0)
.astype(np.int32)
)
With 100k cells and 500k regions this makes a 200 GB matrix (of int32
even though it's soon converted to a bool
).
When this errors (caught with except (ValueError, MemoryError)
), the current solution is to divide the data into 5 partitions and sequentially run the above .groupby
merge the resulting cistopic_obj_list
.
However, I think it would make a lot more sense to avoid creating this wide and dense matrix in the first place. counts_df
is already effectively a tidy dataframe, so it can easily be converted to a coo_matrix
and then a csr_matrix
sparse matrix directly without making the dense pd.DataFrame
.
As an example, see below:
from scipy import sparse
import numpy as np
fragment_matrix = (
counts_df.groupby(["Name", "regionID"], sort=False, observed=True)
.size()
.unstack(level="Name", fill_value=0)
.astype(np.int32)
)
fragment_matrix.columns.names = [None]
# counts_df.Name has many unused cellname categories because the categorical is made using all fragments
# this doesn't matter for making a sparse matrix but these get dropped with `.groupby(observed=True)
# below I remove them for comparison but this step is unnecessary and one
# could just keep the columns and cell names where s_csr.getnnz(axis=0) > 0
new_cat = counts_df.Name.cat.remove_unused_categories()
name_order = new_cat.cat.categories
cat_order = counts_df.regionID.cat.categories
# just for comparison with below
fragment_ordered = fragment_matrix.loc[cat_order][name_order]
# converted to csr matrix in `create_cistopic_object` but do so manually here to compare
fragment_sparse = sparse.csr_matrix(fragment_ordered.to_numpy(), dtype=np.int32)
assert fragment_sparse.data.sum() == len(counts_df)
assert (counts_df.regionID.cat.categories == fragment_ordered.index).all()
# avoid creating dense dataframe using categorical indices
data = np.ones(len(counts_df))
# coo matrix adds duplicates similar so it counts the number of fragments like `.groupby().size()`
# could also accumulate the scores
s_coo = sparse.coo_matrix((data, (counts_df.regionID.cat.codes.values, new_cat.cat.codes.values)))
s_csr = s_coo.tocsr()
# compare and see we have created the same matrix
assert np.array_equal(s_csr.data, fragment_sparse.data)
assert (s_csr.indices == fragment_sparse.indices).all()
assert (s_csr.indptr == fragment_sparse.indptr).all()
Making the csr_matrix sparse matrix avoids making the dense matrix and for my data uses 50-times less memory (e.g. 4GB vs 200GB). The resulting sparse matrix can then be easily passed to create_cistopic_object
:
cistopic_obj = create_cistopic_object(
fragment_matrix=s_csr,
cell_names=list(name_order),
region_names=list(cat_order),
path_to_blacklist=path_to_blacklist,
min_frag=min_frag,
min_cell=min_cell,
is_acc=is_acc,
path_to_fragments={project: path_to_fragments},
project=project,
split_pattern=split_pattern,
)
I haven't tested it that extensively, but I believe the above code would likely generate the same cistopic_obj
without making any wide dense arrays. The order of the cell_names
and region_names
might be slightly different, but in the new sparse version they should be the same order as the categories in counts_df
, which probably makes more sense (and it's easy to call cistopic_obj.subset()
to reorder them.
Do you think it would make sense to add such code to cistopic_class
? Or are there already plans to rewrite these functions to reduce their memory usage? Changing some of the pandas code to polars as has already been done might increase the speed (and with lazy versions combined with filter and select could reduce memory usage) but reducing code that makes these giant arrays would really help cut down on memory usage. I haven't looked that carefully at the QC code but my impression is that it also uses a ton of memory for things that could be done sequentially or without having as many large objects in memory at once.
I'm currently using pycisTopic ver '2.0a0'