scanpy Refactored scanpy.tools._sparse_nanmean to eliminate unnecessary data…

Fixes #1894. Reduced redundant data copying; the original matrix is now copied once instead of twice. One copy remains necessary to replace NaNs with zeros without modifying the original matrix.

[x] Closes #1894
[x] Tests included
[x] Release notes not necessary because

Apr 08 '25 14:04 Reovirus

@flying-sheep Can you assign the reviewer please? Fall test looks like testing bug

Apr 08 '25 15:04 Reovirus

And I can rewrite logic without any coping using numba, but it can be slowly that implemented methods

Apr 08 '25 17:04 Reovirus

I know that I have "No milestone failure" but I haven't permissions to set milestone, probably I need help of maintainers to set it

Apr 09 '25 09:04 Reovirus

@Reovirus I restarted the CI as there should hopefully be no flaky tests at the moment.

Don't worry about the milestone, please.

Apr 11 '25 09:04 Zethson

failure is scipy.sparse.csr_matrix.count_nonzero which has axis argument since scipy 1.15.0. I'll rewrite with numba

Apr 11 '25 09:04 Reovirus

@Zethson cund you please restart CI? I catch very strange bug with http, it's not mine code

Apr 11 '25 14:04 Reovirus

CI pased, can somebody review in some time pls

Apr 11 '25 14:04 Reovirus

I made the benchmarks run, let’s see how well this works!

Could you please add a release note fragment?

hatch run towncrier:create 3570.performance.md

Apr 11 '25 15:04 flying-sheep

Yes, l'll make a note, thanks!)

Apr 11 '25 15:04 Reovirus

Benchmark changes

Change	Before [813608e3]	After [7e23b19d]	Ratio	Benchmark (Parameter)
-	10.8±0.2ms	8.80±0.4ms	0.81	preprocessing_counts.FastSuite.time_log1p('pbmc3k', 'counts-off-axis')
-	48.4±3ms	43.0±0.9ms	0.89	preprocessing_counts.time_filter_genes('pbmc3k', 'counts-off-axis')
+	261M	308M	1.18	tools.peakmem_score_genes

Comparison: https://github.com/scverse/scanpy/compare/813608e3af7e25214f4370dabb31b13a5e8159aa..7e23b19d9a0e89d1820c6fb1854c5408a814fed4 Last changed: Wed, 30 Apr 2025 16:08:04 +0000

More details: https://github.com/scverse/scanpy/pull/3570/checks?check_run_id=41426505892

Apr 11 '25 16:04 scverse-benchmark[bot]

according to the benchmarks, this is actually slower than before.

the benchmarks are a bit flaky, so it could be wrong, but usually a factor of 2.5 like here is real.

Apr 11 '25 16:04 flying-sheep

Understand. Try to change and speed up

Apr 11 '25 17:04 Reovirus

I've replace np.add.at by np.bincount (it was slowest place).

Now results is on pictures (old is old functions, new is new implementation, new_parallel is new with some njint in one case. Matrix was (5000, 5000) and I use 100 random matrix (~20% nans, ~20% zeros in one matrix) for testing). X axis like {input_format: csc|csr}{axis: 0|1}{version: old|new|new_with_numba}, y-axis is _sparse_nanmean spent time in seconds on my ubuntu VM (low-resources)

test_2000_2000_100_boxplot test_2000_2000_100_violin

UPD. I've seen benchmark is added automatically

Apr 11 '25 21:04 Reovirus

@flying-sheep We have a small win (23.4±2ms | 21.8±0.8ms | 0.93 | tools.time_score_genes, too high p-value to display cause stderr is high, implementation is lessly effective on smal-amount-of-zeros data) in score_genes time and small lose in memory. Should I optimize memory usage? (it can be caused by mask=np.isnan(data), instead of thic I can backgroundly recalc mask in np.nan_to_num and reduce memory on 1 array)))

Apr 11 '25 23:04 Reovirus

I think these should be parallel numba kernels with mask arguments for major and minor axis. I have a working implementation in rapids-singlecell that doesnt need to copy at all. Starting from _get_mean_var is I think the best way forward

Apr 14 '25 06:04 Intron7

@Intron7 I rewrite logics. But my local benchmarking gives another result (I just measure peak memory by ), I'm trying to fix it. And I have no idea about some metrics like preprocessing_counts.FastSuite.time_log1p('pbmc3k', 'counts-off-axis'). It change sign randomly, is it normal?

Apr 30 '25 18:04 Reovirus

The kernels already look good. However my main point I was trying to make is that you can create the means without subsetting. That means you wouldnt need to use the get subset function but just use masks within your kernel.

May 13 '25 09:05 Intron7