scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

Score genes - control genes index of wrong format

Open Hrovatin opened this issue 3 years ago • 4 comments

  • [x] I have checked that this issue has not already been reported.
  • [ ] I have confirmed this bug exists on the latest version of scanpy.
  • [ ] (optional) I have confirmed this bug exists on the master branch of scanpy.

We ran across a case where score genes does not work for one specific gene, but works for others; the gene is in anndata. The same code works on other datasets preprocessed in the same way.

genes=['INS']
score_name='ins'
'INS' in adata_rawnorm.var_names
True
IndexError                                Traceback (most recent call last)
Input In [109], in <module>
----> 1 sc.tl.score_genes(adata_rawnorm, gene_list=genes, score_name=score_name+'_score',  use_raw=False)

File ~/miniconda3/envs/block/lib/python3.8/site-packages/scanpy/tools/_score_genes.py:167, in score_genes(adata, gene_list, ctrl_size, gene_pool, n_bins, score_name, random_state, copy, use_raw)
    164 else:
    165     X_list = np.nanmean(X_list, axis=1, dtype='float64')
--> 167 X_control = _adata[:, control_genes].X
    168 if issparse(X_control):
    169     X_control = np.array(_sparse_nanmean(X_control, axis=1)).flatten()

File ~/miniconda3/envs/block/lib/python3.8/site-packages/anndata/_core/anndata.py:625, in AnnData.X(self)
    622         X = _subset(X, (self._oidx, self._vidx))
    623 elif self.is_view:
    624     X = as_view(
--> 625         _subset(self._adata_ref.X, (self._oidx, self._vidx)),
    626         ElementRef(self, "X"),
    627     )
    628 else:
    629     X = self._X

File ~/miniconda3/envs/block/lib/python3.8/functools.py:875, in singledispatch.<locals>.wrapper(*args, **kw)
    871 if not args:
    872     raise TypeError(f'{funcname} requires at least '
    873                     '1 positional argument')
--> 875 return dispatch(args[0].__class__)(*args, **kw)

File ~/miniconda3/envs/block/lib/python3.8/site-packages/anndata/_core/index.py:127, in _subset(a, subset_idx)
    125 if all(isinstance(x, cabc.Iterable) for x in subset_idx):
    126     subset_idx = np.ix_(*subset_idx)
--> 127 return a[subset_idx]

IndexError: arrays used as indices must be of integer (or boolean) type

Hrovatin avatar Feb 23 '22 20:02 Hrovatin

I encountered the same problem, when I created a small artificial AnnData with a single gene in gene_list for some unit test. Here is my analysis of the problem:

In this line https://github.com/scverse/scanpy/blob/63141908601632638db8a79e8a1dfa8509cd27af/scanpy/tools/_score_genes.py#L182 control_genes was actually empty, hence the index error. The reason for the empty control_genes genes is https://github.com/scverse/scanpy/blob/63141908601632638db8a79e8a1dfa8509cd27af/scanpy/tools/_score_genes.py#L173 control_genes contained some genes before (in my artificial case only one), but they are removed here, since the genes in control_genes also appeared in gene_list.

I think this is where the bug resides: I assume control_genes should not contain genes from gene_list in the first place. Hence, this line https://github.com/scverse/scanpy/blob/63141908601632638db8a79e8a1dfa8509cd27af/scanpy/tools/_score_genes.py#L167 would need to be changed/complemented: An additional filter for not being a gene in gene_list should fix this issue, if I understand this code correctly.

That being said, I suppose that this issue appears rather rarely in the realistically sized datasets. I assume, that the probability of accidentally picking genes from gene_list as control_genes decreases with increasing number of genes. At least I have not encountered this exception in my experimental datasets. Furthermore, this issue does not make the result wrong, as far as I understand the algorithm, because the control genes are selected randomly anyway.

padix-key avatar Jan 25 '24 15:01 padix-key

This error happened to me too when working on a small dataset and scoring a single gene with ctrl_size=1. This happens at random in the following line: https://github.com/scverse/scanpy/blob/383a61b2db0c45ba622f231f01d0e7546d99566b/scanpy/tools/_score_genes.py#L168

I was working on a small test dataset with limited features and calling the score_genes_cell_cycle function, where only 1 cell cycle gene is left and ctrl_size is set as follows: https://github.com/scverse/scanpy/blob/383a61b2db0c45ba622f231f01d0e7546d99566b/scanpy/tools/_score_genes.py#L258

mumichae avatar Feb 21 '24 14:02 mumichae

I’d consider the issue to be a bug, can we label it as such?

mumichae avatar Feb 21 '24 14:02 mumichae

I agree that this could be handled but does it actually make sense to score only 1 gene using this function? I feel like it assumes there is a group of genes.

lazappi avatar Mar 14 '24 08:03 lazappi