Fixes #756

Performance is suprisingly low. It seems that the type-unstable pandas operation can be quite inefficient. This will provide speedups for mid size datasets with real cell type names, but will not for toy examples. There are also significant memory improvements.

TODO:

[ ] Demonstrate current speed and memory improvements
[ ] More complicated tests

Apr 25 '22 17:04 ivirshup

Codecov Report

:exclamation: No coverage uploaded for pull request base (master@7f70372). Click here to learn what that means. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##             master     #763   +/-   ##
=========================================
  Coverage          ?   83.21%           
=========================================
  Files             ?       34           
  Lines             ?     4450           
  Branches          ?        0           
=========================================
  Hits              ?     3703           
  Misses            ?      747           
  Partials          ?        0

Impacted Files	Coverage Δ
anndata/_core/merge.py	`94.00% <100.00%> (ø)`

Apr 25 '22 17:04 codecov[bot]

Benchmarks

Setup

from natsort import natsorted
from string import ascii_letters

import pandas as pd, numpy as np, anndata as ad

from scipy import sparse

letters = list(ascii_letters)

names = ["".join(np.array(letters)[np.random.randint(len(letters) - 1, size=30)]) for _ in range(50)]

N = 1_000_000

a = ad.AnnData(
    X=sparse.csr_matrix((N, 0), dtype=np.float32),
    obs=pd.DataFrame(
        {"cat": pd.Categorical.from_codes(np.random.randint(25, size=N), categories=names[::2])},
        index=[f"cell{i:06}" for i in range(N)]
    )
)
b = ad.AnnData(
    X=sparse.csr_matrix((N, 0), dtype=np.float32),
    obs=pd.DataFrame(
        {"cat": pd.Categorical.from_codes(np.random.randint(25, size=N), categories=names[25:])},
        index=[f"cell{i:06}" for i in range(N, N * 2)]
    )
)

on `master`

%%timeit
c = ad.concat([a, b])
c.strings_to_categoricals()

960 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This branch

%%timeit
c = ad.concat([a, b])

768 ms ± 5.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And it turns out most of that 700ms is checking that each of the million obs names is unique.

So, it's a speed up, but not huge.

Aug 30 '22 14:08 ivirshup

anndata
anndata copied to clipboard

Implement concat with category merging

Codecov Report

Benchmarks

Setup

on `master`

This branch

anndata anndata copied to clipboard

Implement concat with category merging

Codecov Report

Benchmarks

Setup

on master

This branch

anndata
anndata copied to clipboard

on `master`