scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

Refactored scanpy.tools._sparse_nanmean to eliminate unnecessary data…

Open Reovirus opened this issue 8 months ago • 17 comments

Fixes #1894. Reduced redundant data copying; the original matrix is now copied once instead of twice. One copy remains necessary to replace NaNs with zeros without modifying the original matrix.​

  • [x] Closes #1894
  • [x] Tests included
  • [x] Release notes not necessary because

Reovirus avatar Apr 08 '25 14:04 Reovirus

@flying-sheep Can you assign the reviewer please? Fall test looks like testing bug

Reovirus avatar Apr 08 '25 15:04 Reovirus

And I can rewrite logic without any coping using numba, but it can be slowly that implemented methods

Reovirus avatar Apr 08 '25 17:04 Reovirus

I know that I have "No milestone failure" but I haven't permissions to set milestone, probably I need help of maintainers to set it

Reovirus avatar Apr 09 '25 09:04 Reovirus

@Reovirus I restarted the CI as there should hopefully be no flaky tests at the moment.

Don't worry about the milestone, please.

Zethson avatar Apr 11 '25 09:04 Zethson

failure is scipy.sparse.csr_matrix.count_nonzero which has axis argument since scipy 1.15.0. I'll rewrite with numba

Reovirus avatar Apr 11 '25 09:04 Reovirus

@Zethson cund you please restart CI? I catch very strange bug with http, it's not mine code

Reovirus avatar Apr 11 '25 14:04 Reovirus

CI pased, can somebody review in some time pls

Reovirus avatar Apr 11 '25 14:04 Reovirus

I made the benchmarks run, let’s see how well this works!

Could you please add a release note fragment?

hatch run towncrier:create 3570.performance.md

flying-sheep avatar Apr 11 '25 15:04 flying-sheep

Yes, l'll make a note, thanks!)

Reovirus avatar Apr 11 '25 15:04 Reovirus

Benchmark changes

Change Before [813608e3] After [7e23b19d] Ratio Benchmark (Parameter)
- 10.8±0.2ms 8.80±0.4ms 0.81 preprocessing_counts.FastSuite.time_log1p('pbmc3k', 'counts-off-axis')
- 48.4±3ms 43.0±0.9ms 0.89 preprocessing_counts.time_filter_genes('pbmc3k', 'counts-off-axis')
+ 261M 308M 1.18 tools.peakmem_score_genes

Comparison: https://github.com/scverse/scanpy/compare/813608e3af7e25214f4370dabb31b13a5e8159aa..7e23b19d9a0e89d1820c6fb1854c5408a814fed4 Last changed:

More details: https://github.com/scverse/scanpy/pull/3570/checks?check_run_id=41426505892

scverse-benchmark[bot] avatar Apr 11 '25 16:04 scverse-benchmark[bot]

according to the benchmarks, this is actually slower than before.

the benchmarks are a bit flaky, so it could be wrong, but usually a factor of 2.5 like here is real.

flying-sheep avatar Apr 11 '25 16:04 flying-sheep

Understand. Try to change and speed up

Reovirus avatar Apr 11 '25 17:04 Reovirus

I've replace np.add.at by np.bincount (it was slowest place).

Now results is on pictures (old is old functions, new is new implementation, new_parallel is new with some njint in one case. Matrix was (5000, 5000) and I use 100 random matrix (~20% nans, ~20% zeros in one matrix) for testing). X axis like {input_format: csc|csr}{axis: 0|1}{version: old|new|new_with_numba}, y-axis is _sparse_nanmean spent time in seconds on my ubuntu VM (low-resources)

test_2000_2000_100_boxplot test_2000_2000_100_violin

UPD. I've seen benchmark is added automatically

Reovirus avatar Apr 11 '25 21:04 Reovirus

@flying-sheep We have a small win (23.4±2ms | 21.8±0.8ms | 0.93 | tools.time_score_genes, too high p-value to display cause stderr is high, implementation is lessly effective on smal-amount-of-zeros data) in score_genes time and small lose in memory. Should I optimize memory usage? (it can be caused by mask=np.isnan(data), instead of thic I can backgroundly recalc mask in np.nan_to_num and reduce memory on 1 array)))

Reovirus avatar Apr 11 '25 23:04 Reovirus

I think these should be parallel numba kernels with mask arguments for major and minor axis. I have a working implementation in rapids-singlecell that doesnt need to copy at all. Starting from _get_mean_var is I think the best way forward

Intron7 avatar Apr 14 '25 06:04 Intron7

@Intron7 I rewrite logics. But my local benchmarking gives another result (I just measure peak memory by ), I'm trying to fix it. And I have no idea about some metrics like preprocessing_counts.FastSuite.time_log1p('pbmc3k', 'counts-off-axis'). It change sign randomly, is it normal?

Reovirus avatar Apr 30 '25 18:04 Reovirus

The kernels already look good. However my main point I was trying to make is that you can create the means without subsetting. That means you wouldnt need to use the get subset function but just use masks within your kernel.

Intron7 avatar May 13 '25 09:05 Intron7