This is an attempt to fix #1011

I have taken my cue from https://github.com/lmcinnes/umap/pull/259/files as suggested in the issue's thread. I have also added some tests (which focus on ensuring that results are successfully created rather than ensuring correctness).

I hope this helps.

Sep 12 '20 17:09 dwnorton

The CI tests failed due to the image matching problems (in test_violin and test_pbmc3k) fixed by PR #1422 which is now merged into master. Should I merge master into my fix branch?

Sep 22 '20 20:09 dwnorton

yes, please merge with the master branch

Sep 23 '20 19:09 fidelram

Thanks so much for your contribution @dwnorton ! Looks good. Two questions: 1) is there any way to do without densifying the matrix? it's not a big deal though since the matrix is small. 2) Is the UMAP API for the umap.distances.pairwise_special_metric function version-dependent somehow in other words do we need to constrain our umap dependency to a specific version, or is the API stable across all commonly used UMAP versions?

Sep 23 '20 19:09 gokceneraslan

Thanks for the feedback. I have now merged master into the fix branch and the CI tests are happy. To address the shrewd questions asked by @gokceneraslan:

I pondered over this myself. As I mentioned (above), I based my approach on a pull request in umap's own codebase which also resorts to densification of the matrix. As far as I can see, pairwise_special_metric doesn't directly support a sparse matrix and I don't see an alternative approach that wouldn't involve duplicating some of its implementation and (probably) sacrificing some of the benefits (i.e. parallel processing). A deeper analysis by another brain might draw different conclusions.
Another good question. Based on my "git blame" detective work, pairwise_special_metric was introduced in a change on 20 Nov 2018 which should have seen it incorporated into version 0.3.7. Since then its signature has remained compatible (with the only change being the addition of the kwds argument). scanpy's requirements.txt already has umap-learn set to a minimum version of 0.3.10 so I believe we're good on that front.

Sep 25 '20 12:09 dwnorton

Hi @dwnorton,

Sorry for the delay. I can see now that there are a few changes about this on the UMAP side:

Few questions:

Can we improve the sparse array handling (e.g. mimicking UMAP e.g. using SPARSE_SPECIAL_METRICS)
Can we also use the SKLEARN_PAIRWISE_VALID_METRICS instead of catching the ValueError? Would that be safer?
If we start using these keywords, do we need to change any of the requirements of scanpy?

Apr 19 '21 12:04 gokceneraslan

Would it be easier to just use NNDescent's handling of distances here?

Also, this was brought up on the discourse recently.

Apr 19 '21 13:04 ivirshup

@dwnorton seems fine, but could you please commit anything here to trigger ci?

@ivirshup do we want to merge this or replace the current neighbors computation method with nndescent?

May 23 '21 16:05 Koncopd

Hello. Sorry I have been silent but I haven't used scanpy for a few months and thought there was nothing more for me to do with this pull request.

@Koncopd: the commit I made in September last year (e4483e9) triggered the CI and passed the tests. Is there something further you need me to do?

Thanks.

May 23 '21 21:05 dwnorton

Long term, I'd prefer to just use pynndescent since it would be a more simple implementation. That could change some results. What do you think?

May 24 '21 10:05 ivirshup

/AzurePipelines run

May 24 '21 10:05 ivirshup

Azure Pipelines successfully started running 1 pipeline(s).

May 24 '21 10:05 azure-pipelines[bot]

Codecov Report

:exclamation: No coverage uploaded for pull request base (master@62bb643). Click here to learn what that means. The diff coverage is 74.48%.

@@            Coverage Diff            @@
##             master    #1413   +/-   ##
=========================================
  Coverage          ?   71.34%           
=========================================
  Files             ?       92           
  Lines             ?    11186           
  Branches          ?        0           
=========================================
  Hits              ?     7981           
  Misses            ?     3205           
  Partials          ?        0

Impacted Files	Coverage Δ
scanpy/__main__.py	`0.00% <0.00%> (ø)`
scanpy/plotting/_tools/__init__.py	`76.74% <ø> (ø)`
scanpy/plotting/_tools/paga.py	`70.62% <ø> (ø)`
scanpy/plotting/_tools/scatterplots.py	`86.95% <ø> (ø)`
scanpy/plotting/_utils.py	`54.61% <ø> (ø)`
scanpy/plotting/palettes.py	`52.77% <ø> (ø)`
scanpy/preprocessing/__init__.py	`100.00% <ø> (ø)`
scanpy/preprocessing/_combat.py	`90.69% <ø> (ø)`
scanpy/preprocessing/_deprecated/__init__.py	`88.88% <ø> (ø)`
...preprocessing/_deprecated/highly_variable_genes.py	`94.18% <ø> (ø)`
... and 52 more

May 24 '21 11:05 codecov[bot]

@ivirshup i would also prefer pynndescent. I can write it, but i need to look and check that there are no potential problems with this replacement.

May 27 '21 13:05 Koncopd

@Koncopd when we talked about this last there were concerns about backwards reproducibility. I'm wondering if this logic would fix that:

If the dataset is "small" and the metric is defined by scikit-learn, compute complete distances
For all other cases use pynndescent

Would this be sufficient to keep results the same, or do we compute dense distances for the more esoteric metrics now?

Jun 15 '21 10:06 ivirshup

Yes, we can do like this. "do we compute dense distances for the more esoteric metrics now?" - actually i am not sure, it needs to be checked.

Jun 15 '21 21:06 Koncopd

scanpy
scanpy copied to clipboard

Use umap's pairwise_special_metric as a fallback for small AnnDatas with metrics unsupported by sklearn

Codecov Report

scanpy scanpy copied to clipboard

Use umap's pairwise_special_metric as a fallback for small AnnDatas with metrics unsupported by sklearn

Codecov Report

scanpy
scanpy copied to clipboard