anndata
anndata copied to clipboard
Implement concat with category merging
Fixes #756
Performance is suprisingly low. It seems that the type-unstable pandas operation can be quite inefficient. This will provide speedups for mid size datasets with real cell type names, but will not for toy examples. There are also significant memory improvements.
TODO:
- [ ] Demonstrate current speed and memory improvements
- [ ] More complicated tests
Codecov Report
:exclamation: No coverage uploaded for pull request base (
master@7f70372
). Click here to learn what that means. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #763 +/- ##
=========================================
Coverage ? 83.21%
=========================================
Files ? 34
Lines ? 4450
Branches ? 0
=========================================
Hits ? 3703
Misses ? 747
Partials ? 0
Impacted Files | Coverage Δ | |
---|---|---|
anndata/_core/merge.py | 94.00% <100.00%> (ø) |
Benchmarks
Setup
from natsort import natsorted
from string import ascii_letters
import pandas as pd, numpy as np, anndata as ad
from scipy import sparse
letters = list(ascii_letters)
names = ["".join(np.array(letters)[np.random.randint(len(letters) - 1, size=30)]) for _ in range(50)]
N = 1_000_000
a = ad.AnnData(
X=sparse.csr_matrix((N, 0), dtype=np.float32),
obs=pd.DataFrame(
{"cat": pd.Categorical.from_codes(np.random.randint(25, size=N), categories=names[::2])},
index=[f"cell{i:06}" for i in range(N)]
)
)
b = ad.AnnData(
X=sparse.csr_matrix((N, 0), dtype=np.float32),
obs=pd.DataFrame(
{"cat": pd.Categorical.from_codes(np.random.randint(25, size=N), categories=names[25:])},
index=[f"cell{i:06}" for i in range(N, N * 2)]
)
)
on master
%%timeit
c = ad.concat([a, b])
c.strings_to_categoricals()
960 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This branch
%%timeit
c = ad.concat([a, b])
768 ms ± 5.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
And it turns out most of that 700ms is checking that each of the million obs names is unique.
So, it's a speed up, but not huge.