scanpy
scanpy copied to clipboard
Score genes - control genes index of wrong format
- [x] I have checked that this issue has not already been reported.
- [ ] I have confirmed this bug exists on the latest version of scanpy.
- [ ] (optional) I have confirmed this bug exists on the master branch of scanpy.
We ran across a case where score genes does not work for one specific gene, but works for others; the gene is in anndata. The same code works on other datasets preprocessed in the same way.
genes=['INS']
score_name='ins'
'INS' in adata_rawnorm.var_names
True
IndexError Traceback (most recent call last)
Input In [109], in <module>
----> 1 sc.tl.score_genes(adata_rawnorm, gene_list=genes, score_name=score_name+'_score', use_raw=False)
File ~/miniconda3/envs/block/lib/python3.8/site-packages/scanpy/tools/_score_genes.py:167, in score_genes(adata, gene_list, ctrl_size, gene_pool, n_bins, score_name, random_state, copy, use_raw)
164 else:
165 X_list = np.nanmean(X_list, axis=1, dtype='float64')
--> 167 X_control = _adata[:, control_genes].X
168 if issparse(X_control):
169 X_control = np.array(_sparse_nanmean(X_control, axis=1)).flatten()
File ~/miniconda3/envs/block/lib/python3.8/site-packages/anndata/_core/anndata.py:625, in AnnData.X(self)
622 X = _subset(X, (self._oidx, self._vidx))
623 elif self.is_view:
624 X = as_view(
--> 625 _subset(self._adata_ref.X, (self._oidx, self._vidx)),
626 ElementRef(self, "X"),
627 )
628 else:
629 X = self._X
File ~/miniconda3/envs/block/lib/python3.8/functools.py:875, in singledispatch.<locals>.wrapper(*args, **kw)
871 if not args:
872 raise TypeError(f'{funcname} requires at least '
873 '1 positional argument')
--> 875 return dispatch(args[0].__class__)(*args, **kw)
File ~/miniconda3/envs/block/lib/python3.8/site-packages/anndata/_core/index.py:127, in _subset(a, subset_idx)
125 if all(isinstance(x, cabc.Iterable) for x in subset_idx):
126 subset_idx = np.ix_(*subset_idx)
--> 127 return a[subset_idx]
IndexError: arrays used as indices must be of integer (or boolean) type
I encountered the same problem, when I created a small artificial AnnData with a single gene in gene_list for some unit test. Here is my analysis of the problem:
In this line
https://github.com/scverse/scanpy/blob/63141908601632638db8a79e8a1dfa8509cd27af/scanpy/tools/_score_genes.py#L182
control_genes was actually empty, hence the index error.
The reason for the empty control_genes genes is
https://github.com/scverse/scanpy/blob/63141908601632638db8a79e8a1dfa8509cd27af/scanpy/tools/_score_genes.py#L173
control_genes contained some genes before (in my artificial case only one), but they are removed here, since the genes in control_genes also appeared in gene_list.
I think this is where the bug resides:
I assume control_genes should not contain genes from gene_list in the first place. Hence, this line
https://github.com/scverse/scanpy/blob/63141908601632638db8a79e8a1dfa8509cd27af/scanpy/tools/_score_genes.py#L167
would need to be changed/complemented:
An additional filter for not being a gene in gene_list should fix this issue, if I understand this code correctly.
That being said, I suppose that this issue appears rather rarely in the realistically sized datasets. I assume, that the probability of accidentally picking genes from gene_list as control_genes decreases with increasing number of genes.
At least I have not encountered this exception in my experimental datasets.
Furthermore, this issue does not make the result wrong, as far as I understand the algorithm, because the control genes are selected randomly anyway.
This error happened to me too when working on a small dataset and scoring a single gene with ctrl_size=1. This happens at random in the following line: https://github.com/scverse/scanpy/blob/383a61b2db0c45ba622f231f01d0e7546d99566b/scanpy/tools/_score_genes.py#L168
I was working on a small test dataset with limited features and calling the score_genes_cell_cycle function, where only 1 cell cycle gene is left and ctrl_size is set as follows:
https://github.com/scverse/scanpy/blob/383a61b2db0c45ba622f231f01d0e7546d99566b/scanpy/tools/_score_genes.py#L258
I’d consider the issue to be a bug, can we label it as such?
I agree that this could be handled but does it actually make sense to score only 1 gene using this function? I feel like it assumes there is a group of genes.