scanpy Improve flexibility of `sc.pp.scrublet()`

Please make sure these conditions are met

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of scanpy.
[ ] (optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

I'm trying to predict my doublets using scrublet. So, I'm using the pp.scrublet function to do that. The thing is, when I run it without adding the parameter batch_key, it works. But is failing when I try to use that parameter.

I tried these options before submitting the Issue:

[X] Checked for NaN in original Anndata and Anndata adata_after_qc
[X] Checked for NaN in Anndata with simulated doublets adata_sim
[X] It's not a problem on my .obs because I already used the batch effect correction in other Scanpy functions and it worked
[X] Ran the function with and without the parameter and only works without the parameter
[X] Run the simulation within each batch independently

I think it might be a problem managing the batches when you have the combination of pre-simulated doublets using sc.pp.scrublet_simulate_doublets() + original Anndata + batch_key because that the output of sc.pp.scrublet_simulate_doublets() removes the .obs in the Anndata output.

Minimal code sample

# /// script
# requires-python = ">=3.12"
# dependencies = [
#   "scanpy[scrublet] @ git+https://github.com/scverse/scanpy.git@main",
# ]
# ///
#
# This script automatically imports the development branch of scanpy to check for issues

import anndata as ad
import numpy as np
import scanpy as sc
import scipy.sparse as sp

# TODO: create `adata_after_qc`
# Simulate an AnnData similar to the one in question
n_obs, n_vars = 4151, 8342
dtype = np.int64

# Generate a sparse random count matrix (Poisson-distributed)
X = sp.csr_matrix(np.random.poisson(1, size=(n_obs, n_vars)), dtype=dtype)

adata_dummy = ad.AnnData(X)

# Example metadata
adata_dummy.obs["batch"] = np.random.choice(["r1", "r2", "r3", "r4"], size=n_obs)
adata_dummy.var["gene_symbol"] = [f"gene_{i}" for i in range(n_vars)]

# Store raw counts layer
adata_dummy.layers["raw_counts"] = adata_dummy.X.copy()


# Select the gene that I will modify to have zero variance
zero_var_gene_idx = 13
zero_var_gene_name = adata_dummy.var_names[zero_var_gene_idx]

# Get indices of cells belonging to batch 'r1'
batch_mask = adata_dummy.obs["batch"] == "r1"
r1_idx = np.where(batch_mask)[0]

# Set that gene's values in r1 to a constant
X_dense = adata_dummy.X.toarray()
X_dense[r1_idx, zero_var_gene_idx] = 2  # Constant value of gene

# Convert back to csr sparse matrix
adata_dummy.X = sp.csr_matrix(X_dense)

adata_sim = sc.pp.scrublet_simulate_doublets(
    adata_dummy, 
    synthetic_doublet_umi_subsampling = 1.0, # 1.0 == doublet is created adding counts from two random samples 
    layer="raw_counts"
)

sc.pp.scrublet(
    adata_dummy, # Original anndata 
    adata_sim=adata_sim, # Anndata with simulated doublets and modified .X 
    knn_dist_metric="euclidean",
    n_prin_comps=20,
    batch_key= "batch"
)

Error output

ValueError: Input X contains NaN.
PCA does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Versions

marimo	0.15.2
scanpy	1.11.4
anndata	0.12.2
seaborn	0.13.2
numpy	2.2.6
scipy	1.16.1
pooch	1.8.2 (v1.8.2)
pandas	2.3.2
----	----
h5py	3.14.0
Markdown	3.9
charset-normalizer	3.4.3
pymdown-extensions	10.16.1
cffi	2.0.0
typing_extensions	4.15.0
setuptools	80.9.0
msgpack	1.1.1
jedi	0.19.2
crc32c	2.7.1
websockets	15.0.1
parso	0.8.5
statsmodels	0.14.5
threadpoolctl	3.6.0
colorama	0.4.6
zarr	3.1.2
pillow	11.3.0
numba	0.61.2
llvmlite	0.44.0
session-info2	0.2.1
scikit-learn	1.7.2
psutil	7.0.0
six	1.17.0
narwhals	2.4.0
itsdangerous	2.2.0
anyio	4.10.0
Pygments	2.19.2
joblib	1.5.2
PyYAML	6.0.2
patsy	1.0.1
pyparsing	3.2.3
h11	0.16.0
packaging	25.0
matplotlib	3.10.6
legacy-api-wrap	1.4.1
uvicorn	0.35.0
kiwisolver	1.4.9
natsort	8.4.0
click	8.2.1
numcodecs	0.16.1
donfig	0.8.1.post1
python-dateutil	2.9.0.post0
docutils	0.22
platformdirs	4.5.0
pytz	2025.2
zstandard	0.25.0
tomlkit	0.13.3
tqdm	4.67.1
pycparser	2.22
sniffio	1.3.1
starlette	0.47.3
cycler	0.12.1
----	----
Python	3.13.7 | packaged by conda-forge | (main, Sep  3 2025, 14:30:35) [GCC 14.3.0]
OS	Linux-6.8.0-85-generic-x86_64-with-glibc2.39
CPU	128 logical CPU cores, x86_64
GPU	No GPU found
Updated	2025-10-17 15:04

Oct 17 '25 15:10 zekepuli

Update: This, indeed, worked. So, the problem is definitely when using adata_sim parameter.

# Input is raw data in .X. output are doublet_score, predicted_doublet in .obs and scrublet info in .uns
sc.pp.scrublet(
    adata_first_qc, # Original anndata 
    knn_dist_metric= "euclidean",
    n_prin_comps=20,
    batch_key= "batch", 
    random_state= 42
)

Oct 23 '25 14:10 zekepuli

Hi, please add a complete, reproducible example. I added a TODO to your code where steps are missing!

Please also make sure the parameter docs apply:

The annotated data matrix of shape n_obs × n_vars. Rows correspond to cells and columns to genes. Expected to be un-normalised where adata_sim is not supplied, in which case doublets will be simulated and pre-processing applied to both objects. If adata_sim is supplied, this should be the observed transcriptomes processed consistently (filtering, transform, normalisaton, hvg) with adata_sim.

Oct 30 '25 13:10 flying-sheep

I added a TODO to your code where steps are missing!

Hahaha not even my boss dared to go that far. I will try to deliver them, thank you!

Oct 30 '25 14:10 zekepuli

@flying-sheep I've found the problem! It was my fault because I wasn't using the observed transcriptomes processed with filtering, transform, normalization and hvg when using adata_sim. I honestly didn't want to create another anndata just to have the .X with the normalized data because I already did that in a layer that I actually used for the other Scanpy functions that do allow me to select the layer. So I thought I could just use the parameters normalize_variance, log_transform, and mean_center and thus avoid creating another anndata and that was a mistake. But why I was getting the error only when using the batch parameter? Because in the overall expression I didn't have zero variance in genes. But, I've found a gene with zero variance in one of the batches and that gene is converted to NaN when normalize_variance= True. When I made the proper preprocessing, I get rid of the zero variance in genes in all batches.

Nov 04 '25 19:11 zekepuli

Hahaha not even my boss dared to go that far. I will try to deliver them, thank you!

Ha! I think the roles are a bit different here 😉 I wouldn’t dare either if I was your boss!

But your boss is the one who wants your work, and you’re the one who wanted our work here 😄

So I thought I could just use the parameters normalize_variance, log_transform, and mean_center and thus avoid creating another anndata and that was a mistake.

I think we should make that possible.

But why I was getting the error only when using the batch parameter?
Because in the overall expression I didn't have zero variance in genes.

Hmm, maybe there’s a way to catch cases like this on our side and give a better error message.

When I made the proper preprocessing, I get rid of the zero variance in genes in all batches.

Wonderful! I’ll leave this open to track the doc updates and parameter updates that might have prevented you from running into this.

Nov 07 '25 17:11 flying-sheep

But your boss is the one who wants your work, and you’re the one who wanted our work here 😄

Hahaha that's very true

I think we should make that possible.

That would be wonderful!

Hmm, maybe there’s a way to catch cases like this on our side and give a better error message.

I've found it when I calculated the variance in the whole expression table and then repeated the process by batch. Maybe you could do something like that. But honestly, I don't know how computationally efficient that solution is.

Wonderful! I’ll leave this open to track the doc updates and parameter updates that might have prevented you from running into this.

Thanks for the attention. Good luck!

Nov 07 '25 17:11 zekepuli