Can't save model after multi-GPU training.
Hi,
I'm using mutli-GPU for training and it's causing me to be unable to save the model. Any advice on how to fix?
print("Training scVI model...")
vae = scvi.model.SCVI(
adata_combined,
n_hidden=256,
n_latent=50,
n_layers=2
)
vae.train(max_epochs=400, accelerator="gpu", devices=-1, strategy="ddp_find_unused_parameters_true")
vae.save(dir_path="SCVI_Model_integrated_before_clustering", save_anndata=True, overwrite=True)
adata_combined.write('Total_Dataset_SCVI_integrated_before_clustering.h5ad')
`Trainer.fit` stopped: `max_epochs=400` reached.
[rank3]: Traceback (most recent call last):
[rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 500, in <module>
[rank3]: adata_combined.write('Total_Dataset_SCVI_integrated_before_clustering.h5ad')
[rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/anndata/_core/anndata.py", line 1929, in write_h5ad
[rank3]: write_h5ad(
[rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 77, in write_h5ad
[rank3]: with h5py.File(filepath, mode) as f:
[rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/h5py/_hl/files.py", line 564, in __init__
[rank3]: fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
[rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/h5py/_hl/files.py", line 244, in make_fid
[rank3]: fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
[rank3]: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
[rank3]: File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
[rank3]: File "h5py/h5f.pyx", line 122, in h5py.h5f.create
[rank3]: BlockingIOError: [Errno 11] Unable to synchronously create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
[rank2]: Traceback (most recent call last):
[rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 500, in <module>
[rank2]: adata_combined.write('Total_Dataset_SCVI_integrated_before_clustering.h5ad')
[rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/anndata/_core/anndata.py", line 1929, in write_h5ad
[rank2]: write_h5ad(
[rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 77, in write_h5ad
[rank2]: with h5py.File(filepath, mode) as f:
[rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/h5py/_hl/files.py", line 564, in __init__
[rank2]: fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
[rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/h5py/_hl/files.py", line 244, in make_fid
[rank2]: fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
[rank2]: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
[rank2]: File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
[rank2]: File "h5py/h5f.pyx", line 122, in h5py.h5f.create
[rank2]: BlockingIOError: [Errno 11] Unable to synchronously create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
[rank1]: Traceback (most recent call last):
[rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 500, in <module>
[rank1]: adata_combined.write('Total_Dataset_SCVI_integrated_before_clustering.h5ad')
[rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/anndata/_core/anndata.py", line 1929, in write_h5ad
[rank1]: write_h5ad(
[rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 77, in write_h5ad
[rank1]: with h5py.File(filepath, mode) as f:
[rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/h5py/_hl/files.py", line 564, in __init__
[rank1]: fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
[rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/h5py/_hl/files.py", line 244, in make_fid
[rank1]: fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
[rank1]: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
[rank1]: File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
[rank1]: File "h5py/h5f.pyx", line 122, in h5py.h5f.create
[rank1]: BlockingIOError: [Errno 11] Unable to synchronously create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
[rank: 1] Child process with PID 57426 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
/var/spool/slurmd/job1133221/slurm_script: line 28: 56343 Killed python scvi_processing.py
Versions:
1.1.6.post2
Can you try it with a recent version of scvi-tools?
Hi, my institutional HPC has issues installing jax and scvi, so I'm loathe to attempt messing with it. Wrapping the save block of code with this line fixed the issue for me.
`
if vae.trainer.is_global_zero:
# Save results
print("Saving results before clustering (rank 0)...")
vae.save(dir_path="SCVI_Model_integrated_before_clustering", save_anndata=True, overwrite=True)
adata_combined.write('Total_Dataset_SCVI_integrated_before_clustering.h5ad')
# Perform clustering
sc.pp.neighbors(adata_combined, use_rep="X_scVI")
sc.tl.leiden(adata_combined)
# Save final results
print("Saving final results (rank 0)...")
adata_combined.write('Total_Dataset_SCVI_integrated.h5ad')
vae.save(dir_path="SCVI_Model_integrated", save_anndata=True, overwrite=True)
print("Process completed successfully on rank 0.")
else:
print(f"Process skipping save/clustering on rank {vae.trainer.global_rank}.")
`
In the name of scientific purity I tried updating it and re-running the code and it spawned a new error
[rank3]: Traceback (most recent call last): [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 1038, in <module> [rank3]: main() [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 1008, in main [rank3]: vae.train(max_epochs=max_epochs, accelerator="gpu", devices=-1, strategy="ddp_find_unused_parameters_true") [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/scvi/model/base/_training_mixin.py", line 143, in train [rank3]: return runner() [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/scvi/train/_trainrunner.py", line 98, in __call__ [rank3]: self.trainer.fit(self.training_plan, self.data_splitter) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/scvi/train/_trainer.py", line 220, in fit [rank3]: super().fit(*args, **kwargs) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 561, in fit [rank3]: call._call_and_handle_interrupt( [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt [rank3]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch [rank3]: return function(*args, **kwargs) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 599, in _fit_impl [rank3]: self._run(model, ckpt_path=ckpt_path) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 988, in _run [rank3]: self.strategy.setup(self) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/strategies/ddp.py", line 171, in setup [rank3]: self.configure_ddp() [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/strategies/ddp.py", line 283, in configure_ddp [rank3]: self.model = self._setup_model(self.model) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/strategies/ddp.py", line 195, in _setup_model [rank3]: return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__ [rank3]: _verify_param_shape_across_processes(self.process_group, parameters) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/torch/distributed/utils.py", line 294, in _verify_param_shape_across_processes [rank3]: return dist._verify_params_across_processes(process_group, tensors, logger) [rank3]: RuntimeError: [3]: params[0] in this process with sizes [5004] appears not to match sizes of the same param in process 0. [rank: 3] Child process with PID 6628 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟 /var/spool/slurmd/job1134327/slurm_script: line 28: 5712 Killed python scvi_processing.py --test
Im not sure what is that, but the latest works with python 3.10 +, so perhaps need to update it first (and I would recommend to start a new env during this process)
I upgraded to python 3.10 scvi-tools 1.3.0 and this error persists, the fix I mentioned above still works though.
Trainer.fitstopped:max_epochs=400reached. [rank2]: Traceback (most recent call last): [rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 1132, in <module> [rank2]: main() [rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 1112, in main [rank2]: vae.save(dir_path="SCVI_Model_integrated_before_clustering", save_anndata=True, overwrite=True) [rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/scvi/model/base/_base_model.py", line 627, in save [rank2]: self.adata.write( [rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/anndata/_core/anndata.py", line 1871, in write_h5ad [rank2]: write_h5ad( [rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/anndata/_io/h5ad.py", line 79, in write_h5ad [rank2]: with h5py.File(filepath, mode) as f: [rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/h5py/_hl/files.py", line 564, in __init__ [rank2]: fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr) [rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/h5py/_hl/files.py", line 244, in make_fid [rank2]: fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl) [rank2]: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper [rank2]: File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper [rank2]: File "h5py/h5f.pyx", line 122, in h5py.h5f.create [rank2]: BlockingIOError: [Errno 11] Unable to synchronously create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable') [rank1]: Traceback (most recent call last): [rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 1132, in <module> [rank1]: main() [rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 1112, in main [rank1]: vae.save(dir_path="SCVI_Model_integrated_before_clustering", save_anndata=True, overwrite=True) [rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/scvi/model/base/_base_model.py", line 627, in save [rank1]: self.adata.write( [rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/anndata/_core/anndata.py", line 1871, in write_h5ad [rank1]: write_h5ad( [rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/anndata/_io/h5ad.py", line 79, in write_h5ad [rank1]: with h5py.File(filepath, mode) as f: [rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/h5py/_hl/files.py", line 564, in __init__ [rank1]: fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr) [rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/h5py/_hl/files.py", line 244, in make_fid [rank1]: fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl) [rank1]: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper [rank1]: File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper [rank1]: File "h5py/h5f.pyx", line 122, in h5py.h5f.create [rank1]: BlockingIOError: [Errno 11] Unable to synchronously create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable') [rank: 1] Child process with PID 6257 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟 /var/spool/slurmd/job1134509/slurm_script: line 38: 5408 Killed python scvi_processing.py --test
It should be working. Im attaching a notebook that shows the basic tutorial is trained using multiGPU then saved and loaded again. https://app.reviewnb.com/scverse/scvi-tutorials/blob/Ori-custom-dataloaders-tutos/use_cases/multiGPU.ipynb/ (the same works from non jupyter also)
I suspect some environment/infrastructure issues: You mentioned you have problems updating packages? what about other packages that are used in the process such as anndata and torch. You mentioned Jax. -How is it related? Is saving and loading failed only after using multiGPU or also with non multigpu training? Did you start a new env? What is you multi-GPU infra?
Hi,
I started a new environment. Here's my full conda list. Saving and loading is only failing after multiGPU training on 4 A100s, it may be an infrastructure issue with how my institution has set up the multi gpu cluster. I've also included my linux distro information in case that's the source, as that has happened with other softwares due to glibc distribution problems etc.
NAME="Red Hat Enterprise Linux Server"
VERSION="7.9 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.9"
PRETTY_NAME="Red Hat Enterprise Linux"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.9
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.9"
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
absl-py 2.2.2 pypi_0 pypi
aiohappyeyeballs 2.6.1 pypi_0 pypi
aiohttp 3.11.18 pypi_0 pypi
aiosignal 1.3.2 pypi_0 pypi
anndata 0.11.4 pypi_0 pypi
array-api-compat 1.11.2 pypi_0 pypi
async-timeout 5.0.1 pypi_0 pypi
attrs 25.3.0 pypi_0 pypi
biomart 0.9.2 pypi_0 pypi
bzip2 1.0.8 h4bc722e_7 conda-forge
ca-certificates 2025.4.26 hbd8a1cb_0 conda-forge
certifi 2025.4.26 pypi_0 pypi
charset-normalizer 3.4.1 pypi_0 pypi
chex 0.1.89 pypi_0 pypi
contourpy 1.3.2 pypi_0 pypi
cycler 0.12.1 pypi_0 pypi
docrep 0.3.2 pypi_0 pypi
etils 1.12.2 pypi_0 pypi
exceptiongroup 1.2.2 pypi_0 pypi
filelock 3.18.0 pypi_0 pypi
flax 0.10.4 pypi_0 pypi
fonttools 4.57.0 pypi_0 pypi
frozenlist 1.6.0 pypi_0 pypi
fsspec 2025.3.2 pypi_0 pypi
grpcio 1.71.0 pypi_0 pypi
h5py 3.13.0 pypi_0 pypi
humanize 4.12.2 pypi_0 pypi
idna 3.10 pypi_0 pypi
igraph 0.11.8 pypi_0 pypi
importlib-resources 6.5.2 pypi_0 pypi
jax 0.4.35 pypi_0 pypi
jax-cuda12-pjrt 0.5.3 pypi_0 pypi
jax-cuda12-plugin 0.5.3 pypi_0 pypi
jaxlib 0.4.35 pypi_0 pypi
jinja2 3.1.6 pypi_0 pypi
joblib 1.4.2 pypi_0 pypi
kiwisolver 1.4.8 pypi_0 pypi
ld_impl_linux-64 2.43 h712a8e2_4 conda-forge
legacy-api-wrap 1.4.1 pypi_0 pypi
leidenalg 0.10.2 pypi_0 pypi
libexpat 2.7.0 h5888daf_0 conda-forge
libffi 3.4.6 h2dba641_1 conda-forge
libgcc 14.2.0 h767d61c_2 conda-forge
libgcc-ng 14.2.0 h69a702a_2 conda-forge
libgomp 14.2.0 h767d61c_2 conda-forge
liblzma 5.8.1 hb9d3cd8_0 conda-forge
libnsl 2.0.1 hd590300_0 conda-forge
libsqlite 3.49.1 hee588c1_2 conda-forge
libuuid 2.38.1 h0b41bf4_0 conda-forge
libxcrypt 4.4.36 hd590300_1 conda-forge
libzlib 1.3.1 hb9d3cd8_2 conda-forge
lightning 2.5.1.post0 pypi_0 pypi
lightning-utilities 0.14.3 pypi_0 pypi
llvmlite 0.44.0 pypi_0 pypi
markdown 3.8 pypi_0 pypi
markdown-it-py 3.0.0 pypi_0 pypi
markupsafe 3.0.2 pypi_0 pypi
matplotlib 3.10.1 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
ml-collections 1.1.0 pypi_0 pypi
ml-dtypes 0.5.1 pypi_0 pypi
mpmath 1.3.0 pypi_0 pypi
msgpack 1.1.0 pypi_0 pypi
mudata 0.3.1 pypi_0 pypi
multidict 6.4.3 pypi_0 pypi
multipledispatch 1.0.0 pypi_0 pypi
natsort 8.4.0 pypi_0 pypi
ncurses 6.5 h2d0b736_3 conda-forge
nest-asyncio 1.6.0 pypi_0 pypi
networkx 3.4.2 pypi_0 pypi
numba 0.61.2 pypi_0 pypi
numpy 2.2.5 pypi_0 pypi
numpyro 0.18.0 pypi_0 pypi
nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi
nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi
nvidia-cuda-nvcc-cu12 12.8.93 pypi_0 pypi
nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi
nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi
nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi
nvidia-curand-cu12 10.3.5.147 pypi_0 pypi
nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi
nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi
nvidia-cusparselt-cu12 0.6.2 pypi_0 pypi
nvidia-nccl-cu12 2.21.5 pypi_0 pypi
nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi
nvidia-nvtx-cu12 12.4.127 pypi_0 pypi
openssl 3.5.0 h7b32b05_0 conda-forge
opt-einsum 3.4.0 pypi_0 pypi
optax 0.2.4 pypi_0 pypi
orbax-checkpoint 0.11.5 pypi_0 pypi
packaging 24.2 pypi_0 pypi
pandas 2.2.3 pypi_0 pypi
patsy 1.0.1 pypi_0 pypi
pillow 11.2.1 pypi_0 pypi
pip 25.1 pyh8b19718_0 conda-forge
propcache 0.3.1 pypi_0 pypi
protobuf 6.30.2 pypi_0 pypi
pygments 2.19.1 pypi_0 pypi
pynndescent 0.5.13 pypi_0 pypi
pyparsing 3.2.3 pypi_0 pypi
pyro-api 0.1.2 pypi_0 pypi
pyro-ppl 1.9.1 pypi_0 pypi
python 3.10.17 hd6af730_0_cpython conda-forge
python-dateutil 2.9.0.post0 pypi_0 pypi
pytorch-lightning 2.5.1.post0 pypi_0 pypi
pytz 2025.2 pypi_0 pypi
pyyaml 6.0.2 pypi_0 pypi
readline 8.2 h8c095d6_2 conda-forge
requests 2.32.3 pypi_0 pypi
rich 14.0.0 pypi_0 pypi
scanpy 1.11.1 pypi_0 pypi
scikit-learn 1.5.2 pypi_0 pypi
scikit-misc 0.5.1 pypi_0 pypi
scipy 1.15.2 pypi_0 pypi
scvi-tools 1.3.0 pypi_0 pypi
seaborn 0.13.2 pypi_0 pypi
session-info2 0.1.2 pypi_0 pypi
setuptools 79.0.1 pyhff2d567_0 conda-forge
simplejson 3.20.1 pypi_0 pypi
six 1.17.0 pypi_0 pypi
sparse 0.16.0 pypi_0 pypi
statsmodels 0.14.4 pypi_0 pypi
sympy 1.13.1 pypi_0 pypi
tensorboard 2.19.0 pypi_0 pypi
tensorboard-data-server 0.7.2 pypi_0 pypi
tensorstore 0.1.74 pypi_0 pypi
texttable 1.7.0 pypi_0 pypi
threadpoolctl 3.6.0 pypi_0 pypi
tk 8.6.13 noxft_h4845f30_101 conda-forge
toolz 1.0.0 pypi_0 pypi
torch 2.6.0 pypi_0 pypi
torchmetrics 1.7.1 pypi_0 pypi
tqdm 4.67.1 pypi_0 pypi
treescope 0.1.9 pypi_0 pypi
triton 3.2.0 pypi_0 pypi
typing-extensions 4.13.2 pypi_0 pypi
tzdata 2025.2 pypi_0 pypi
umap-learn 0.5.7 pypi_0 pypi
urllib3 2.4.0 pypi_0 pypi
werkzeug 3.1.3 pypi_0 pypi
wheel 0.45.1 pyhd8ed1ab_1 conda-forge
xarray 2025.3.1 pypi_0 pypi
yarl 1.20.0 pypi_0 pypi
zipp 3.21.0 pypi_0 pypi
My only guess right now is the infrastructure. We work with Ubuntu here. We need to make sure its working for Red Hat also (although I think it's not the OS in your case)
To clarify, it runs smoothly with a single GPU. @ilan-gold to me this looks more like an AnnData issue. It’s trying to use multiple file streams to write the file(?). Any recommendation to avoid this?
Without knowing anything about the issue, Can, I don't think h5 is threadsafe (and using it in processes for writing is probably also not safe). But I would need to dig into the issue to know more.
Yes it runs smoothly with a single GPU. I believe the distribution across 4 GPUs means they don't all exit at the same time, and at somepoint in the code two different GPUs try to get to the save code at the same time and that's where it breaks, I don't know enough about the ddp parallelization to read deeper than that.
BTW I'm not saying it won't work with processes, it's just that I don’t know. We work read (not write) h5 in processes by reopening the file which bypasses the lock (meant for preventing threaded usage). But I dont know what happens when you write from multiple processes. I could come up with scenarios where it might work and where it might not
Zarr would be my recommendation for avoiding the issue but writing into one store with from multiple sources is tricky independent of the file backend
Im closing it here, but keeping this remark if such a thing happens again:
Hi, my institutional HPC has issues installing jax and scvi, so I'm loathe to attempt messing with it. Wrapping the save block of code with this line fixed the issue for me.
`
if vae.trainer.is_global_zero: # Save results print("Saving results before clustering (rank 0)...") vae.save(dir_path="SCVI_Model_integrated_before_clustering", save_anndata=True, overwrite=True) adata_combined.write('Total_Dataset_SCVI_integrated_before_clustering.h5ad') # Perform clustering sc.pp.neighbors(adata_combined, use_rep="X_scVI") sc.tl.leiden(adata_combined) # Save final results print("Saving final results (rank 0)...") adata_combined.write('Total_Dataset_SCVI_integrated.h5ad') vae.save(dir_path="SCVI_Model_integrated", save_anndata=True, overwrite=True) print("Process completed successfully on rank 0.") else: print(f"Process skipping save/clustering on rank {vae.trainer.global_rank}.")`