sc.pl.dotplot is surprisingly ineffective for subsets of data
Please make sure these conditions are met
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of scanpy.
- [ ] (optional) I have confirmed this bug exists on the main branch of scanpy.
What happened?
During data analysis, a very common operation is to check some markers in particular clusters of the data. Typically I do it by something like:
sc.pl.dotplot(
adata[adata.obs["cell_type"].str.startswith("Macrophage")],
var_names=macrophage_markers,
groupby="cell_type"
)
Surprisingly, this is a very computationally intensive step if a subset of cells is selected. It runs longer than on the full adata and sometimes even causes a kernel crash if I’m close to the RAM limit. I assume, part of adata is copied in this step? Would be great to fix this because this is a very frequent operation in the analysis, and I don’t see reasons why it is so expensive
Minimal code sample
cd_4_t_cell_markers = {
"Central memory": ["IL7R", "TCF7", "LEF1", "CCR7", "SELL", "KLF2", "SOCS3"],
"NR3C1": ["NR3C1", "NCOA1", "ZEB2", "AOAH", "PARP8", "CMIP", "CHST11", "RBPJ", "RUNX2", "ADAM19"],
"Naive": ["BACH2", "KLF12", "FOXP1", "PRKCA", "PDE3B", "ABCC1", "SMCHD1"],
"Treg": ["FOXP3", "IL2RA", "CTLA4", "TNFRSF4", "TNFRSF18", "PELI1", "TIGIT", "IKZF2"],
"Effector memory": ["S100A4", "IL32", "LGALS1", "ALOX5AP", "CD52", "CCL5"],
"Resident memory": ["ITGAE", "CD69"],
}
adata = sc.datasets.pbmc3k_processed()
# Uncomment in jupyter notebook
# %%time
sc.pl.dotplot(adata, var_names=cd_4_t_cell_markers, groupby="louvain")
# Run in a separate cell
# %%time
sc.pl.dotplot(
adata[adata.obs["louvain"].str.contains(["T cells"]),
var_names=cd_4_t_cell_markers, groupby="louvain"
)
The second cell runs for longer even though it uses less data. For my dataset (200k cells), a plot for subset of the cells took twice as long – 11.1s vs 23s
Error output
Versions
-----
anndata 0.10.9
scanpy 1.10.3
-----
PIL 10.1.0
adjustText 1.3.0
anyio NA
appnope 0.1.2
asttokens NA
attr 23.1.0
attrs 23.1.0
autograd NA
autograd_gamma NA
babel 2.11.0
backcall 0.2.0
bottleneck 1.4.2
brotli 1.0.9
cachetools 5.3.2
causallearn NA
cell2cell 0.7.4
certifi 2023.11.17
cffi 1.16.0
chardet 5.2.0
charset_normalizer 2.0.4
comm 0.1.2
cryptography 41.0.7
cycler 0.12.1
cython_runtime NA
dateutil 2.8.2
db_dtypes 1.2.0
debugpy 1.6.7
decorator 5.1.1
decoupler 1.8.0
defusedxml 0.7.1
dill 0.3.8
docrep 0.3.2
dowhy 0.11.1
ehrapy 0.8.0
executing 0.8.3
fastjsonschema NA
fhiry 3.2.2
filelock 3.14.0
fontTools 4.46.0
formulaic 1.1.1
future 1.0.0
google NA
grpc 1.64.1
grpc_status NA
h5py 3.9.0
idna 3.4
igraph 0.11.5
imblearn 0.12.3
interface_meta 1.3.0
ipykernel 6.25.0
ipywidgets 8.0.4
jedi 0.18.1
jinja2 3.0.3
joblib 1.3.2
json5 NA
jsonschema 4.19.2
jsonschema_specifications NA
jupyter_events 0.8.0
jupyter_server 2.10.0
jupyterlab_server 2.25.1
kiwisolver 1.4.5
kneed 0.8.5
lamin_utils 0.13.2
legacy_api_wrap NA
leidenalg 0.10.2
liana 1.5.1
lifelines 0.29.0
llvmlite 0.43.0
markupsafe 2.1.1
matplotlib 3.8.2
matplotlib_inline 0.1.6
memory_profiler 0.61.0
missingno 0.5.2
mizani 0.14.2
mpl_toolkits NA
mpmath 1.3.0
mudata 0.3.1
natsort 7.1.1
nbformat 5.9.2
networkx 3.2.1
numba 0.60.0
numexpr 2.8.7
numpy 1.26.2
overrides NA
packaging 23.1
pandas 2.2.3
parso 0.8.3
patient_representation 0.1.29
patpy 0.0.3
patsy 0.5.4
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.10.0
plotnine 0.15.0
prometheus_client NA
prompt_toolkit 3.0.36
psutil 5.9.0
ptyprocess 0.7.0
pure_eval 0.2.2
pyarrow 16.1.0
pycparser 2.21
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pydot 2.0.0
pygments 2.15.1
pynndescent 0.5.11
pyparsing 3.1.1
pythonjsonlogger NA
pytz 2023.3.post1
rapidfuzz 3.9.3
referencing NA
requests 2.31.0
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
rich NA
rpds NA
scipy 1.11.4
seaborn 0.13.2
send2trash NA
session_info 1.0.0
setuptools 68.0.0
six 1.16.0
sklearn 1.6.1
sniffio 1.2.0
socks 1.7.1
sparse 0.15.4
stack_data 0.2.0
statannotations 0.7.2
statsmodels 0.14.0
sympy 1.13.1
tableone 0.8.0
tabulate 0.9.0
tensorly 0.8.1
texttable 1.7.0
thefuzz 0.22.1
threadpoolctl 3.2.0
torch 2.5.0
torchgen NA
tornado 6.3.3
tqdm 4.66.5
traitlets 5.7.1
typing_extensions NA
umap 0.5.5
urllib3 1.26.18
vscode NA
wcwidth 0.2.5
websocket 0.58.0
wrapt 1.16.0
yaml 6.0.1
zmq 25.1.0
-----
IPython 8.15.0
jupyter_client 8.6.0
jupyter_core 5.5.0
jupyterlab 4.0.8
notebook 7.0.6
-----
Python 3.11.5 (main, Sep 11 2023, 08:31:25) [Clang 14.0.6 ]
macOS-12.6-arm64-arm-64bit
-----
Session information updated at 2025-07-17 15:35
/Users/vladimir.shitov/miniconda3/envs/2023_12_COPD/lib/python3.11/site-packages/session_info/main.py:213: DeprecationWarning: Accessing jsonschema.__version__ is deprecated and will be removed in a future release. Use importlib.metadata directly to query for jsonschema's version.
/Users/vladimir.shitov/miniconda3/envs/2023_12_COPD/lib/python3.11/site-packages/session_info/main.py:213: DeprecationWarning: Accessing attrs.__version__ is deprecated and will be removed in a future release. Use importlib.metadata directly to query for attrs's packaging metadata.
/Users/vladimir.shitov/miniconda3/envs/2023_12_COPD/lib/python3.11/site-packages/session_info/main.py:213: DeprecationWarning: Accessing attr.__version__ is deprecated and will be removed in a future release. Use importlib.metadata directly to query for attrs's packaging metadata.
I did a bit of a contrived example here to bring the size of the dataset up:
adata = ad.concat([sc.datasets.pbmc3k()] * 50 + [sc.datasets.pbmc3k_processed()] * 10, join="outer")
adata.obs_names_make_unique()
adata.obs["louvain"] = adata.obs["louvain"].fillna("B cells")
which gives about 160000 rows and 32K features.
Then when I run your two examples, things return instantly. Further, I checked the memory pressure:
%memit sc.pl.dotplot(adata, var_names=cd_4_t_cell_markers, groupby="louvain")
# peak memory: 1738.94 MiB, increment: 1430.94 MiB
%memit sc.pl.dotplot(adata[adata.obs["louvain"].str.contains("T cells")], var_names=cd_4_t_cell_markers, groupby="louvain")
# peak memory: 1892.92 MiB, increment: 154.17 MiB
which is basically what I'd expect given the code dotplot calls (which contains a section in which your data is in fact copied once to create a groupby-able dataframe).
That being said, we could probably refactor this section of the code to use sc.get.aggregate to reduce memory pressure from the copy, although that would require refactoring the BasePlot class. Overall, a large effort.
Perhaps on Monday we can sit together a bit, or you can send me a bit more complete of an example with your data? Or you could post here a bit more info i.e., about X (sparse, dense, sparsity etc) and whether or not this still happens with a newer version of scanpy/anndata/other pacakges (I see your env has some less-than-up-to-date deps).
Of course, I'd be glad to show where I faced this problem. I will also test it out in a new environment
@VladimirShitov, how is the speed if you do a .copy() on the subset, i.e.
sc.pl.dotplot(
adata[adata.obs["louvain"].str.contains(["T cells"]).copy(),
var_names=cd_4_t_cell_markers, groupby="louvain"
)
I think I also noticed in the past that some operations can be slower on views, but I also had the impression it got fixed at some point.