Refactored scanpy.tools._sparse_nanmean to eliminate unnecessary data…
Fixes #1894. Reduced redundant data copying; the original matrix is now copied once instead of twice. One copy remains necessary to replace NaNs with zeros without modifying the original matrix.
- [x] Closes #1894
- [x] Tests included
- [x] Release notes not necessary because
@flying-sheep Can you assign the reviewer please? Fall test looks like testing bug
And I can rewrite logic without any coping using numba, but it can be slowly that implemented methods
I know that I have "No milestone failure" but I haven't permissions to set milestone, probably I need help of maintainers to set it
@Reovirus I restarted the CI as there should hopefully be no flaky tests at the moment.
Don't worry about the milestone, please.
failure is scipy.sparse.csr_matrix.count_nonzero which has axis argument since scipy 1.15.0. I'll rewrite with numba
@Zethson cund you please restart CI? I catch very strange bug with http, it's not mine code
CI pased, can somebody review in some time pls
I made the benchmarks run, let’s see how well this works!
Could you please add a release note fragment?
hatch run towncrier:create 3570.performance.md
Yes, l'll make a note, thanks!)
Benchmark changes
| Change | Before [813608e3] | After [7e23b19d] | Ratio | Benchmark (Parameter) |
|---|---|---|---|---|
| - | 10.8±0.2ms | 8.80±0.4ms | 0.81 | preprocessing_counts.FastSuite.time_log1p('pbmc3k', 'counts-off-axis') |
| - | 48.4±3ms | 43.0±0.9ms | 0.89 | preprocessing_counts.time_filter_genes('pbmc3k', 'counts-off-axis') |
| + | 261M | 308M | 1.18 | tools.peakmem_score_genes |
Comparison: https://github.com/scverse/scanpy/compare/813608e3af7e25214f4370dabb31b13a5e8159aa..7e23b19d9a0e89d1820c6fb1854c5408a814fed4 Last changed:
More details: https://github.com/scverse/scanpy/pull/3570/checks?check_run_id=41426505892
according to the benchmarks, this is actually slower than before.
the benchmarks are a bit flaky, so it could be wrong, but usually a factor of 2.5 like here is real.
Understand. Try to change and speed up
I've replace np.add.at by np.bincount (it was slowest place).
Now results is on pictures (old is old functions, new is new implementation, new_parallel is new with some njint in one case. Matrix was (5000, 5000) and I use 100 random matrix (~20% nans, ~20% zeros in one matrix) for testing). X axis like {input_format: csc|csr}{axis: 0|1}{version: old|new|new_with_numba}, y-axis is _sparse_nanmean spent time in seconds on my ubuntu VM (low-resources)
UPD. I've seen benchmark is added automatically
@flying-sheep We have a small win (23.4±2ms | 21.8±0.8ms | 0.93 | tools.time_score_genes, too high p-value to display cause stderr is high, implementation is lessly effective on smal-amount-of-zeros data) in score_genes time and small lose in memory. Should I optimize memory usage? (it can be caused by mask=np.isnan(data), instead of thic I can backgroundly recalc mask in np.nan_to_num and reduce memory on 1 array)))
I think these should be parallel numba kernels with mask arguments for major and minor axis. I have a working implementation in rapids-singlecell that doesnt need to copy at all. Starting from _get_mean_var is I think the best way forward
@Intron7 I rewrite logics. But my local benchmarking gives another result (I just measure peak memory by ), I'm trying to fix it. And I have no idea about some metrics like preprocessing_counts.FastSuite.time_log1p('pbmc3k', 'counts-off-axis'). It change sign randomly, is it normal?
The kernels already look good. However my main point I was trying to make is that you can create the means without subsetting. That means you wouldnt need to use the get subset function but just use masks within your kernel.