scanpy
scanpy copied to clipboard
scanpy.pl.highest_expr_genes boxplots contain extra gene rows
Please make sure these conditions are met
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of scanpy.
- [X] (optional) I have confirmed this bug exists on the main branch of scanpy.
What happened?
Issue was observed and documented more in https://github.com/IGS/gEAR/issues/753
We have recreated the Seurat pipeline (2017 legacy version) from the Scanpy tutorial on it, and we have a step that lets users filter their AnnData object based on genes in cells or cells in genes. The filtered AnnData object is written to disk, and then the top 20 expressed genes are plotted with scanpy.pl.highest_expr_genes
.
However what seems to happen is that the counts_top_genes
DataFrame (created in https://github.com/scverse/scanpy/blob/b918a23eb77462837df90d7b3a30a573989d4d48/src/scanpy/plotting/_qc.py#L78-L93) will still preserve the original Categorical set from the passed in AnnData object. The counts_top_genes
DataFrame shape is correct. When Seaborn plots in https://github.com/scverse/scanpy/blob/b918a23eb77462837df90d7b3a30a573989d4d48/src/scanpy/plotting/_qc.py#L100 it seems to also plot all the Categoricals that are not present in the counts_top_genes
DataFrame.
My temporary hack to fix this is to force my "gene_symbols" argument column to be a mixed-object dtype, which drops the Categoricals and renders the boxplot correctly
if 'gene_symbol' in adata.var.columns and adata.var['gene_symbol'].dtype.name != 'object':
adata.var['gene_symbol'] = adata.var['gene_symbol'].astype('object')
Minimal code sample
import scanpy as sc
<anndata object with a categorical adata.var.gene_symbol column>
sc.pl.highest_expr_genes(adata, n_top=20, gene_symbols='gene_symbol', show=True, save=".png")
I also tried this with the same results
import scanpy as sc
<anndata object with a categorical adata.var.gene_symbol column>
adata.var.index = adata.var.gene_symbol
sc.pl.highest_expr_genes(adata, n_top=20, show=True, save=".png")
Error output
Above is a boxplot from sc.pl.highest_expr_genes
that shows all the Categorical genes in addition to the top-20 as specified in the function argument
Above is the correct boxplot, after my hack was applied to force the adata.var.gene_symbols to be mixed-object datatype instead of Categorical.
Versions
python-3-10-4
aiohttp==3.8.3
anndata==0.10.6
biocode==0.10.0
biopython==1.79
cairosvg==2.7.1
dash-bio==1.0.2
#diffxpy==0.7.4
Flask==3.0.0
Flask-RESTful==0.3.9
gunicorn
h5py==3.10.0
itsdangerous==2.1.2 # See -> https://stackoverflow.com/a/71206978
jupyterlab==4.0.5
jupyter==1.0.0
kaleido==0.2.1
leidenalg==0.10.2
llvmlite==0.41.1
matplotlib==3.9.0
mod-wsgi==4.9.4
more_itertools==9.0.0
mysql-connector-python==8.4.0
numba==0.58.1
numexpr==2.8.4
numpy==1.26.0
opencv-python==4.5.5.64
openpyxl==3.1.5
pandas==2.2.1
Pillow==10.2.0
pika==1.3.1
plotly==5.6.0
python-dotenv==0.20.0
requests==2.31.0
rpy2==3.5.1 # 3.5.2 and up gives errors with rpy2py and py2rpy
sanic
scanpy==1.10.1
scikit-learn==1.0.2
scipy==1.11.04
seaborn==0.13.2
SQLAlchemy==1.4.32
tables==3.9.2 # Read hdf5 files into pandas
xlrd==1.2.0