scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

scanpy.pl.highest_expr_genes boxplots contain extra gene rows

Open adkinsrs opened this issue 7 months ago • 0 comments

Please make sure these conditions are met

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of scanpy.
  • [X] (optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

Issue was observed and documented more in https://github.com/IGS/gEAR/issues/753

We have recreated the Seurat pipeline (2017 legacy version) from the Scanpy tutorial on it, and we have a step that lets users filter their AnnData object based on genes in cells or cells in genes. The filtered AnnData object is written to disk, and then the top 20 expressed genes are plotted with scanpy.pl.highest_expr_genes.

However what seems to happen is that the counts_top_genes DataFrame (created in https://github.com/scverse/scanpy/blob/b918a23eb77462837df90d7b3a30a573989d4d48/src/scanpy/plotting/_qc.py#L78-L93) will still preserve the original Categorical set from the passed in AnnData object. The counts_top_genes DataFrame shape is correct. When Seaborn plots in https://github.com/scverse/scanpy/blob/b918a23eb77462837df90d7b3a30a573989d4d48/src/scanpy/plotting/_qc.py#L100 it seems to also plot all the Categoricals that are not present in the counts_top_genes DataFrame.

My temporary hack to fix this is to force my "gene_symbols" argument column to be a mixed-object dtype, which drops the Categoricals and renders the boxplot correctly

    if 'gene_symbol' in adata.var.columns and adata.var['gene_symbol'].dtype.name != 'object':
        adata.var['gene_symbol'] = adata.var['gene_symbol'].astype('object')

Minimal code sample

import scanpy as sc

<anndata object with a categorical adata.var.gene_symbol column>

sc.pl.highest_expr_genes(adata, n_top=20, gene_symbols='gene_symbol', show=True, save=".png")

I also tried this with the same results

import scanpy as sc

<anndata object with a categorical adata.var.gene_symbol column>


adata.var.index = adata.var.gene_symbol
sc.pl.highest_expr_genes(adata, n_top=20, show=True, save=".png")

Error output

349265437-b0a6e963-5d56-40e6-9922-5e4a543c08cf

Above is a boxplot from sc.pl.highest_expr_genes that shows all the Categorical genes in addition to the top-20 as specified in the function argument

Screenshot 2024-07-17 at 1 23 27 PM

Above is the correct boxplot, after my hack was applied to force the adata.var.gene_symbols to be mixed-object datatype instead of Categorical.

Versions

python-3-10-4

aiohttp==3.8.3
anndata==0.10.6
biocode==0.10.0
biopython==1.79
cairosvg==2.7.1
dash-bio==1.0.2
#diffxpy==0.7.4
Flask==3.0.0
Flask-RESTful==0.3.9
gunicorn
h5py==3.10.0
itsdangerous==2.1.2 # See -> https://stackoverflow.com/a/71206978
jupyterlab==4.0.5
jupyter==1.0.0
kaleido==0.2.1
leidenalg==0.10.2
llvmlite==0.41.1
matplotlib==3.9.0
mod-wsgi==4.9.4
more_itertools==9.0.0
mysql-connector-python==8.4.0
numba==0.58.1
numexpr==2.8.4
numpy==1.26.0
opencv-python==4.5.5.64
openpyxl==3.1.5
pandas==2.2.1
Pillow==10.2.0
pika==1.3.1
plotly==5.6.0
python-dotenv==0.20.0
requests==2.31.0
rpy2==3.5.1 # 3.5.2 and up gives errors with rpy2py and py2rpy
sanic
scanpy==1.10.1
scikit-learn==1.0.2
scipy==1.11.04
seaborn==0.13.2
SQLAlchemy==1.4.32
tables==3.9.2 # Read hdf5 files into pandas
xlrd==1.2.0

adkinsrs avatar Jul 17 '24 17:07 adkinsrs