anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Implement concat with category merging

Open ivirshup opened this issue 2 years ago • 1 comments

Fixes #756

Performance is suprisingly low. It seems that the type-unstable pandas operation can be quite inefficient. This will provide speedups for mid size datasets with real cell type names, but will not for toy examples. There are also significant memory improvements.

TODO:

  • [ ] Demonstrate current speed and memory improvements
  • [ ] More complicated tests

ivirshup avatar Apr 25 '22 17:04 ivirshup

Codecov Report

:exclamation: No coverage uploaded for pull request base (master@7f70372). Click here to learn what that means. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##             master     #763   +/-   ##
=========================================
  Coverage          ?   83.21%           
=========================================
  Files             ?       34           
  Lines             ?     4450           
  Branches          ?        0           
=========================================
  Hits              ?     3703           
  Misses            ?      747           
  Partials          ?        0           
Impacted Files Coverage Δ
anndata/_core/merge.py 94.00% <100.00%> (ø)

codecov[bot] avatar Apr 25 '22 17:04 codecov[bot]

Benchmarks

Setup

from natsort import natsorted
from string import ascii_letters

import pandas as pd, numpy as np, anndata as ad

from scipy import sparse

letters = list(ascii_letters)

names = ["".join(np.array(letters)[np.random.randint(len(letters) - 1, size=30)]) for _ in range(50)]

N = 1_000_000

a = ad.AnnData(
    X=sparse.csr_matrix((N, 0), dtype=np.float32),
    obs=pd.DataFrame(
        {"cat": pd.Categorical.from_codes(np.random.randint(25, size=N), categories=names[::2])},
        index=[f"cell{i:06}" for i in range(N)]
    )
)
b = ad.AnnData(
    X=sparse.csr_matrix((N, 0), dtype=np.float32),
    obs=pd.DataFrame(
        {"cat": pd.Categorical.from_codes(np.random.randint(25, size=N), categories=names[25:])},
        index=[f"cell{i:06}" for i in range(N, N * 2)]
    )
)

on master

%%timeit
c = ad.concat([a, b])
c.strings_to_categoricals()
960 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This branch

%%timeit
c = ad.concat([a, b])
768 ms ± 5.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And it turns out most of that 700ms is checking that each of the million obs names is unique.

So, it's a speed up, but not huge.

ivirshup avatar Aug 30 '22 14:08 ivirshup