scanpy
scanpy copied to clipboard
Ingest won't integrate datasets of different lengths
- [x ] I have checked that this issue has not already been reported.
- [ x] I have confirmed this bug exists on the latest version of scanpy.
- [ ] (optional) I have confirmed this bug exists on the master branch of scanpy.
I am trying to ingest a CITEseq dataset into another clustered dataset. These datasets have different numbers of cells but I ran neighbors(n_neighbors=30) for both prior to running umap. I have confirmed that both datasets have the same variable names and the same number of variable names (38). Both objects look identical when a call adata.var.
I receive the error: "all input arrays must have the same shape".
sc.pp.neighbors(CODEX_sub, n_neighbors=30)
sc.tl.umap(CODEX_sub)
sc.pp.neighbors(adata_sub, n_neighbors = 30)
sc.tl.umap(adata_sub)
sc.tl.ingest(CODEX_sub, adata_sub, obs='leiden', embedding_method='umap')
ValueError Traceback (most recent call last)
<ipython-input-214-01a03312d3df> in <module>
----> 1 sc.tl.ingest(CODEX_sub, adata_sub, obs='leiden', embedding_method='umap')
~\anaconda3\envs\scenv\lib\site-packages\scanpy\tools\_ingest.py in ingest(adata, adata_ref, obs, embedding_method, labeling_method, neighbors_key, inplace, **kwargs)
124 labeling_method = labeling_method * len(obs)
125
--> 126 ing = Ingest(adata_ref, neighbors_key)
127 ing.fit(adata)
128
~\anaconda3\envs\scenv\lib\site-packages\scanpy\tools\_ingest.py in __init__(self, adata, neighbors_key)
383
384 if neighbors_key in adata.uns:
--> 385 self._init_neighbors(adata, neighbors_key)
386 else:
387 raise ValueError(
~\anaconda3\envs\scenv\lib\site-packages\scanpy\tools\_ingest.py in _init_neighbors(self, adata, neighbors_key)
349 else:
350 self._neigh_random_state = neighbors['params'].get('random_state', 0)
--> 351 self._init_pynndescent(neighbors['distances'])
352
353 def _init_pca(self, adata):
~\anaconda3\envs\scenv\lib\site-packages\scanpy\tools\_ingest.py in _init_pynndescent(self, distances)
284
285 first_col = np.arange(distances.shape[0])[:, None]
--> 286 init_indices = np.hstack((first_col, np.stack(distances.tolil().rows)))
287
288 self._nnd_idx = NNDescent(
<__array_function__ internals> in stack(*args, **kwargs)
~\anaconda3\envs\scenv\lib\site-packages\numpy\core\shape_base.py in stack(arrays, axis, out)
424 shapes = {arr.shape for arr in arrays}
425 if len(shapes) != 1:
--> 426 raise ValueError('all input arrays must have the same shape')
427
428 result_ndim = arrays[0].ndim + 1
ValueError: all input arrays must have the same shape
Versions
scanpy==1.7.0 anndata==0.7.6 umap==0.5.1 numpy==1.21.1 scipy==1.7.0 pandas==1.2.5 scikit-learn==0.24.2 statsmodels==0.12.2 python-igraph==0.9.8
Hi, i will check.
Any update? I have the same issue with scanpy==1.8.1
The tutorial data has different row lengths and it works, so that can't be the problem
已收到,谢谢。
To try to understand what's going on, I'm sharing here a comparison of the inputs being provided to the ingest function:
In the Tutorial, this is adata_ref:
n_genes | percent_mito | n_counts | louvain | |
---|---|---|---|---|
cell1 | 781 | 0.030178 | 2419.0 | CD4 T cells |
cell2 | 1352 | 0.037936 | 4903.0 | B cells |
cell3 | 1131 | 0.008897 | 3147.0 | CD4 T cells |
... | ... | ... | ... | ... |
2638 rows × 4 columns
And this is adata:
bulk_labels | n_genes | percent_mito | n_counts | S_score | G2M_score | phase | louvain | |
---|---|---|---|---|---|---|---|---|
cell1 | CD14+ Monocyte | 1003 | 0.023856 | 2557.0 | -0.119160 | -0.816889 | G1 | 1 |
cell2 | Dendritic | 1080 | 0.027458 | 2695.0 | 0.067026 | -0.889498 | S | 1 |
cell3 | CD56+ NK | 1228 | 0.016819 | 3389.0 | -0.147977 | -0.941749 | G1 | 3 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
700 rows × 8 columns
Both tutorial adatas after a successful ingest:
(AnnData object with n_obs × n_vars = 700 × 208
obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_umap', 'rep'
varm: 'PCs'
obsp: 'distances', 'connectivities',
AnnData object with n_obs × n_vars = 2638 × 208
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
var: 'n_cells'
uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
varm: 'PCs'
obsp: 'distances', 'connectivities')
Now my data, adata_ref:
celltype | louvain | |
---|---|---|
cell1 | hepatic stellate cells | 1 |
cell2 | cholangiocytes | 1 |
... | ... | ... |
8439 rows × 2 columns
and my adata that I wish to ingest:
louvain | |
---|---|
cell1 | 0 |
cell2 | 0 |
... |
Both my adata files have the same 40 variables and pca/umaps, they look like this:
(AnnData object with n_obs × n_vars = 8989 × 40
obs: 'louvain'
uns: 'pca', 'neighbors', 'umap', 'louvain', 'louvain_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities',
AnnData object with n_obs × n_vars = 8439 × 40
obs: 'celltype', 'louvain'
uns: 'pca', 'neighbors', 'umap', 'louvain', 'louvain_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities')
I suspect the error stems from the Nearest Neighbours. Or maybe my number of variables (40) is too small?
This is the origin of the error:
Some of the cells don't have neighbours at the variable adata_ref.obsp['distances'].tolil().rows:
array([list([223, 280, 316, 5791]), list([3877, 5899, 7766, 7807]),
list([165, 304, 423, 713]), ..., list([]),
list([94, 865, 7077, 7666]), list([])], dtype=object)
## (the maximum 4 elements of each list comes from having run sc.pp.neighbors(adata_ref, n_neighbors = 5)) ##
The above array is impossible to stack with np.stack due to the sublists having different lengths
A potential solution might be filtering out those cells without neighbours, though this is suboptimal. I have tried it and new rows remain empty. Only after repeating it a second time, it works:
DEFINED_NEIGHB_NUM =5
sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref, n_neighbors = DEFINED_NEIGHB_NUM )
sc.tl.umap(adata_ref)
b = np.array(list(map(len,adata_ref.obsp['distances'].tolil().rows))) == DEFINED_NEIGHB_NUM -1
adata_ref = adata_ref[b]
A better solution would be correcting the Nearest Neighbour assignment so that it doesn't create empty distance lists. Did I understand this correctly?
Ok, I identified the root cause of the neighbor error and reported it as a bug:
https://github.com/scverse/scanpy/issues/2244
This probably happens because you have duplicated rows or cells with almost identical gene counts
Hi, I have the same issue with python 3.8.8, scanpy==1.8.1.
I have fixed it by using the adata_ref
object directly after the sc.pp.neighbors
, and there were no filtration procedures after sc.pp.neighbors
. Wish it will help!
I found that if I extract a subset of cells from adata_ref
, the adata_ref.obsp['distances']
will be changed.
When I used the object that through filtering after sc.pp.neighbors
, the length of items in adata_ref.obsp['distances'].tolil().rows
were range from 34 to 41.
adata_ref = sc.read_h5ad('file_filteration_after_neighbors') ## number of cells: 11262
pd.Series(map(len,adata_ref.obsp['distances'].tolil().rows)).value_counts()
41 104921 40 5188 39 801 38 229 37 88 36 23 35 11 34 1 dtype: int64`
When I used the object that directly after sc.pp.neighbors
, the length of items in adata_ref.obsp['distances'].tolil().rows
were all same as 41.
adata_ref = sc.read_h5ad('file_no_filteration_after_neighbors') ## number of cells: 11646
pd.Series(map(len,adata_ref.obsp['distances'].tolil().rows)).value_counts()
41 111646 dtype: int64`
This error message was hard to debug! Indeed it is due to the behavior of sc.pp.neighbors
. Cells are sometimes given different numbers of neighbors. Sometimes that the errant cells have a number of neighbors greater than zero (unlike as seen in #2244, where the # of neighbors of some cells was 0).
My fix builds on the one above and was:
b = np.array(list(map(len, adata_ref.obsp['distances'].tolil().rows))) # number of neighbors of each cell
adata_ref_subset = adata_ref[np.where(b == DEFINED_NEIGHB_NUM-1)[0]] # select only those with the right number
sc.pp.neighbors(adata_ref_subset, DEFINED_NEIGHB_NUM) # rebuild the neighbor graph
@ViriatoII Your comment helped me fix things – but your fix itself didn't work for me.
First, when I use adata_ref = adata_ref[b]
, with b
defined as above, it interprets b
not as a boolean mask but as an index, returning a single cell duplicated over and over. I'm not sure what the intended behavior is here. My solution is to use adata_ref[np.where(b == n_neigh-1)[0]]
.
However, this subselection changes the number of neighbors that other cells have. For example, for my data,
b = np.array(list(map(len,adata_ref.obsp['distances'].tolil().rows)))
print("Before subselecting", np.unique(b, return_counts = True))
adata_ref_subset = adata_ref[np.where(b == DEFINED_NEIGHB_NUM-1)[0]]
c = np.array(list(map(len,adata_ref_subset.obsp['distances'].tolil().rows)))
print("After subselecting", np.unique(c, return_counts = True))
yields
Before subselecting, (array([13, 14]), array([ 28, 1161359]))
After subselecting, (array([10, 11, 12, 13, 14]),
array([ 15, 1, 633, 46, 1160664])))
To solve both problems I needed to rebuild the neighbor graph.
Hi all, did anyone find version with this bug fixed? Thanks.