Unexpected number of non-zero distances when running `sc.pp.neighbors` with `transformer="pynndescent"`
Please make sure these conditions are met
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of scanpy.
- [ ] (optional) I have confirmed this bug exists on the main branch of scanpy.
What happened?
Per the documentation, it is expected that adata.obsp["distances"] should have k - 1 non-zero entries per row after running sc.pp.neighbors with n_neighbors=k. However, this is not the case when transformer="pynndescent", in which case there are k non-zero entries instead.
Both using transformer="sklearn" or rapids_singlecell.pp.neighbors with algorithm="brute" yields distance matrices as documented.
Minimal code sample
import scanpy as sc
import scipy as sp
adata = sc.AnnData(X=sp.sparse.random(5000, 10000, density=0.1, format="csr", rng=1234))
sc.pp.pca(adata, n_comps=100, svd_solver="arpack", random_state=1234)
k = 30
sc.pp.neighbors(
adata,
n_neighbors=k,
n_pcs=100,
transformer="sklearn",
key_added="sklearn",
random_state=1234,
)
sc.pp.neighbors(
adata,
n_neighbors=k,
n_pcs=100,
transformer="pynndescent",
key_added="pynndescent",
random_state=1234,
)
(adata.obsp["sklearn_distances"].count_nonzero(axis=1) == k - 1).all() # returns np.True_
(adata.obsp["pynndescent_distances"].count_nonzero(axis=1) == k - 1).all() # returns np.False_
(adata.obsp["pynndescent_distances"].count_nonzero(axis=1) == k).all() # returns np.True_
Error output
Versions
| Package | Version |
| ------- | ------- |
| numpy | 2.3.3 |
| scanpy | 1.11.4 |
| scipy | 1.16.2 |
| anndata | 0.12.2 |
| Dependency | Version |
| ----------------- | ----------- |
| h5py | 3.14.0 |
| PyYAML | 6.0.2 |
| zarr | 3.1.3 |
| wcwidth | 0.2.13 |
| ipython | 9.5.0 |
| jupyter_core | 5.8.1 |
| llvmlite | 0.45.0 |
| pillow | 11.3.0 |
| pynndescent | 0.5.13 |
| natsort | 8.4.0 |
| pyparsing | 3.2.4 |
| psutil | 7.1.0 |
| cycler | 0.12.1 |
| setuptools | 80.9.0 |
| asttokens | 3.0.0 |
| traitlets | 5.14.3 |
| pyzmq | 27.1.0 |
| tqdm | 4.67.1 |
| parso | 0.8.5 |
| jupyter_client | 8.6.3 |
| tornado | 6.5.2 |
| prompt_toolkit | 3.0.52 |
| pure_eval | 0.2.3 |
| decorator | 5.2.1 |
| ipykernel | 6.30.1 |
| packaging | 25.0 |
| Pygments | 2.19.2 |
| matplotlib | 3.10.6 |
| numba | 0.62.0 |
| python-dateutil | 2.9.0.post0 |
| donfig | 0.8.1.post1 |
| umap-learn | 0.5.9.post2 |
| session-info2 | 0.2.2 |
| platformdirs | 4.4.0 |
| kiwisolver | 1.4.9 |
| pytz | 2025.2 |
| numcodecs | 0.16.3 |
| pandas | 2.3.2 |
| typing_extensions | 4.15.0 |
| joblib | 1.5.2 |
| executing | 2.2.1 |
| threadpoolctl | 3.6.0 |
| six | 1.17.0 |
| debugpy | 1.8.17 |
| jedi | 0.19.2 |
| comm | 0.2.3 |
| legacy-api-wrap | 1.4.1 |
| stack-data | 0.6.3 |
| scikit-learn | 1.7.2 |
| crc32c | 2.7.1 |
| Component | Info |
| --------- | ------------------------------------------------------------------------------ |
| Python | 3.12.11 | packaged by conda-forge | (main, Jun 4 2025, 14:45:31) [GCC 13.3.0] |
| OS | Linux-4.18.0-425.19.2.el8_7.x86_64-x86_64-with-glibc2.28 |
| CPU | 64 logical CPU cores, x86_64 |
| GPU | ID: 0, NVIDIA H100 PCIe, Driver: 535.104.12, Memory: 81559 MiB |
| Updated | 2025-09-21 01:59 |
I can reproduce this bug and the problem seems to be pynndescent itself:
https://github.com/lmcinnes/pynndescent/blob/master/pynndescent/pynndescent_.py#L2172
This is in line with the KNeighborsMixin from sklearn:
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/neighbors/_base.py#L826
However, for the sklearn flavor, scanpy uses KNeighborsTransformer and calls the fit_transform method. That first fits on X and then explicitly calls transform with X again which in turn calls kneighbors_graph with X which calls kneighbors with X which leads to:
query_is_train = X is None # Now evaluates as False
if query_is_train:
X = self._fit_X
# Include an extra neighbor to account for the sample itself being
# returned, which is removed later
n_neighbors += 1
Just being n_neighbors, not n_neighbors += 1 neighbors.
Since pynndescent is the default for scanpy, the easiest fix would be to pass n_neighbors + 1 to the sklearn transformer and change the documentation, though other implementations that rely on their own transformer would be affected as well. In the end, the question is, whether this is a bug upstream (is KNeighborsTransformer performing as expected? and should pynndescent match this transformer or the other sklearn implementations?) or whether this needs to be solved within scanpy, either by documentation or by adjusting neighbor counts to the respective method's behavior.
After some reflection, I would most likely classify this as a bug in pynndescent, as they explicitly try to emulate sklearn behavior.
transform in sklearn is always self-inclusive and requires X to be specified. So it can only ever be .fit(x).transform(x) within .fit_transform(x). However, pynndescent calls .fit(x).transform(None) from fit_transform(x):
https://github.com/lmcinnes/pynndescent/blob/master/pynndescent/pynndescent_.py#L2257
Still, what is a bit weird, is that fit(x) in sklearn always adjusts n_neighbors to be n_neighbors + 1 but then there is no way of using this self-exlusive fit from its fit_transform method.
Another point of comparison: Phenograph (as wrapped in sc.external.tl.phenograph) uses a self-exclusive kNN graph which it gets by passing n_neighbors + 1 to sklearn.neighbors.NearestNeighbors and then removing the self-column (cf.), so it effectively matches the inadvertent pynndescent behaviour in sc.pp.neighbors.
I don't have strong feelings on which outcome is preferable – only that it be consistent and documented.
I am also open to both - let's wait for feedback from maintainers. I would be happy to create the PR, once a way forward is decided.