Slow update() when indices in modalities haven't changed

Open gtca opened this issue 3 years ago • 1 comments

This issue continues the issue #16.

While a lot of functionality in MuData just as in AnnData cannot be guaranteed in the presence of duplicates in the indices (and many functions will error out), it should be still possible to create an object with such indices (obs_names/var_names).

It might be reasonable not to use expensive joins using multi-level indices when indices haven't changed from the last .update(). For that we will have to remember indices of individual modalities, e.g. in ._mod_index as surfaced in #17. It's unclear if this complexity should be introduced as in most workflows indices are expected to be made unique in the very beginning of the MuData object creation thus bringing down the expected number of uses of a faster .update() in such cases to 0.

Feb 07 '22 14:02 gtca

There's a step in this direction with #24: if the hashes of both row and column indices in the modalities haven't changed, the .update() doesn't need to do anything at all.

Storing a few hashes in the object is negligible, and computing the hash seems to be faster than performing joins on large tables:

from hashlib import sha1
obs_names = np.array([f"obs_{i}" for i in range(10_000_000)])

%%timeit
sha1(obs_names).hexdigest()
# => 511 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We can probably optimise it even further for the scenario where only the columns have to be updated but the obs_names/var_names haven't changed.

May 25 '22 08:05 gtca