scvi-tools icon indicating copy to clipboard operation
scvi-tools copied to clipboard

Can't save model after multi-GPU training.

Open alexanderchang1 opened this issue 7 months ago • 13 comments

Hi,

I'm using mutli-GPU for training and it's causing me to be unable to save the model. Any advice on how to fix?

print("Training scVI model...")
vae = scvi.model.SCVI(
    adata_combined, 
    n_hidden=256, 
    n_latent=50, 
    n_layers=2
)

vae.train(max_epochs=400, accelerator="gpu", devices=-1, strategy="ddp_find_unused_parameters_true")

vae.save(dir_path="SCVI_Model_integrated_before_clustering", save_anndata=True, overwrite=True)
adata_combined.write('Total_Dataset_SCVI_integrated_before_clustering.h5ad')


`Trainer.fit` stopped: `max_epochs=400` reached.
[rank3]: Traceback (most recent call last):
[rank3]:   File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 500, in <module>
[rank3]:     adata_combined.write('Total_Dataset_SCVI_integrated_before_clustering.h5ad')
[rank3]:   File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/anndata/_core/anndata.py", line 1929, in write_h5ad
[rank3]:     write_h5ad(
[rank3]:   File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 77, in write_h5ad
[rank3]:     with h5py.File(filepath, mode) as f:
[rank3]:   File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/h5py/_hl/files.py", line 564, in __init__
[rank3]:     fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
[rank3]:   File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/h5py/_hl/files.py", line 244, in make_fid
[rank3]:     fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
[rank3]:   File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
[rank3]:   File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
[rank3]:   File "h5py/h5f.pyx", line 122, in h5py.h5f.create
[rank3]: BlockingIOError: [Errno 11] Unable to synchronously create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
[rank2]: Traceback (most recent call last):
[rank2]:   File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 500, in <module>
[rank2]:     adata_combined.write('Total_Dataset_SCVI_integrated_before_clustering.h5ad')
[rank2]:   File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/anndata/_core/anndata.py", line 1929, in write_h5ad
[rank2]:     write_h5ad(
[rank2]:   File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 77, in write_h5ad
[rank2]:     with h5py.File(filepath, mode) as f:
[rank2]:   File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/h5py/_hl/files.py", line 564, in __init__
[rank2]:     fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
[rank2]:   File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/h5py/_hl/files.py", line 244, in make_fid
[rank2]:     fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
[rank2]:   File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
[rank2]:   File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
[rank2]:   File "h5py/h5f.pyx", line 122, in h5py.h5f.create
[rank2]: BlockingIOError: [Errno 11] Unable to synchronously create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
[rank1]: Traceback (most recent call last):
[rank1]:   File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 500, in <module>
[rank1]:     adata_combined.write('Total_Dataset_SCVI_integrated_before_clustering.h5ad')
[rank1]:   File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/anndata/_core/anndata.py", line 1929, in write_h5ad
[rank1]:     write_h5ad(
[rank1]:   File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 77, in write_h5ad
[rank1]:     with h5py.File(filepath, mode) as f:
[rank1]:   File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/h5py/_hl/files.py", line 564, in __init__
[rank1]:     fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
[rank1]:   File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/h5py/_hl/files.py", line 244, in make_fid
[rank1]:     fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
[rank1]:   File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
[rank1]:   File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
[rank1]:   File "h5py/h5f.pyx", line 122, in h5py.h5f.create
[rank1]: BlockingIOError: [Errno 11] Unable to synchronously create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
[rank: 1] Child process with PID 57426 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
/var/spool/slurmd/job1133221/slurm_script: line 28: 56343 Killed                  python scvi_processing.py

Versions:

1.1.6.post2

alexanderchang1 avatar Apr 28 '25 14:04 alexanderchang1

Can you try it with a recent version of scvi-tools?

ori-kron-wis avatar Apr 28 '25 15:04 ori-kron-wis

Hi, my institutional HPC has issues installing jax and scvi, so I'm loathe to attempt messing with it. Wrapping the save block of code with this line fixed the issue for me.

`

if vae.trainer.is_global_zero:
    # Save results
    print("Saving results before clustering (rank 0)...")

    vae.save(dir_path="SCVI_Model_integrated_before_clustering", save_anndata=True, overwrite=True)
    adata_combined.write('Total_Dataset_SCVI_integrated_before_clustering.h5ad')

    # Perform clustering
    sc.pp.neighbors(adata_combined, use_rep="X_scVI")
    sc.tl.leiden(adata_combined)

    # Save final results
    print("Saving final results (rank 0)...")
    adata_combined.write('Total_Dataset_SCVI_integrated.h5ad')
    vae.save(dir_path="SCVI_Model_integrated", save_anndata=True, overwrite=True)

    print("Process completed successfully on rank 0.")
else:
    print(f"Process skipping save/clustering on rank {vae.trainer.global_rank}.")

`

alexanderchang1 avatar Apr 28 '25 15:04 alexanderchang1

In the name of scientific purity I tried updating it and re-running the code and it spawned a new error

[rank3]: Traceback (most recent call last): [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 1038, in <module> [rank3]: main() [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 1008, in main [rank3]: vae.train(max_epochs=max_epochs, accelerator="gpu", devices=-1, strategy="ddp_find_unused_parameters_true") [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/scvi/model/base/_training_mixin.py", line 143, in train [rank3]: return runner() [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/scvi/train/_trainrunner.py", line 98, in __call__ [rank3]: self.trainer.fit(self.training_plan, self.data_splitter) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/scvi/train/_trainer.py", line 220, in fit [rank3]: super().fit(*args, **kwargs) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 561, in fit [rank3]: call._call_and_handle_interrupt( [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt [rank3]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch [rank3]: return function(*args, **kwargs) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 599, in _fit_impl [rank3]: self._run(model, ckpt_path=ckpt_path) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 988, in _run [rank3]: self.strategy.setup(self) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/strategies/ddp.py", line 171, in setup [rank3]: self.configure_ddp() [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/strategies/ddp.py", line 283, in configure_ddp [rank3]: self.model = self._setup_model(self.model) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/lightning/pytorch/strategies/ddp.py", line 195, in _setup_model [rank3]: return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__ [rank3]: _verify_param_shape_across_processes(self.process_group, parameters) [rank3]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_jupyter/lib/python3.9/site-packages/torch/distributed/utils.py", line 294, in _verify_param_shape_across_processes [rank3]: return dist._verify_params_across_processes(process_group, tensors, logger) [rank3]: RuntimeError: [3]: params[0] in this process with sizes [5004] appears not to match sizes of the same param in process 0. [rank: 3] Child process with PID 6628 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟 /var/spool/slurmd/job1134327/slurm_script: line 28: 5712 Killed python scvi_processing.py --test

alexanderchang1 avatar Apr 28 '25 16:04 alexanderchang1

Im not sure what is that, but the latest works with python 3.10 +, so perhaps need to update it first (and I would recommend to start a new env during this process)

ori-kron-wis avatar Apr 28 '25 17:04 ori-kron-wis

I upgraded to python 3.10 scvi-tools 1.3.0 and this error persists, the fix I mentioned above still works though.

Trainer.fitstopped:max_epochs=400reached. [rank2]: Traceback (most recent call last): [rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 1132, in <module> [rank2]: main() [rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 1112, in main [rank2]: vae.save(dir_path="SCVI_Model_integrated_before_clustering", save_anndata=True, overwrite=True) [rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/scvi/model/base/_base_model.py", line 627, in save [rank2]: self.adata.write( [rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/anndata/_core/anndata.py", line 1871, in write_h5ad [rank2]: write_h5ad( [rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/anndata/_io/h5ad.py", line 79, in write_h5ad [rank2]: with h5py.File(filepath, mode) as f: [rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/h5py/_hl/files.py", line 564, in __init__ [rank2]: fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr) [rank2]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/h5py/_hl/files.py", line 244, in make_fid [rank2]: fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl) [rank2]: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper [rank2]: File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper [rank2]: File "h5py/h5f.pyx", line 122, in h5py.h5f.create [rank2]: BlockingIOError: [Errno 11] Unable to synchronously create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable') [rank1]: Traceback (most recent call last): [rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 1132, in <module> [rank1]: main() [rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/LobularTumorigenesisProject/scvi_processing.py", line 1112, in main [rank1]: vae.save(dir_path="SCVI_Model_integrated_before_clustering", save_anndata=True, overwrite=True) [rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/scvi/model/base/_base_model.py", line 627, in save [rank1]: self.adata.write( [rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/anndata/_core/anndata.py", line 1871, in write_h5ad [rank1]: write_h5ad( [rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/anndata/_io/h5ad.py", line 79, in write_h5ad [rank1]: with h5py.File(filepath, mode) as f: [rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/h5py/_hl/files.py", line 564, in __init__ [rank1]: fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr) [rank1]: File "/ix1/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu_3_10/lib/python3.10/site-packages/h5py/_hl/files.py", line 244, in make_fid [rank1]: fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl) [rank1]: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper [rank1]: File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper [rank1]: File "h5py/h5f.pyx", line 122, in h5py.h5f.create [rank1]: BlockingIOError: [Errno 11] Unable to synchronously create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable') [rank: 1] Child process with PID 6257 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟 /var/spool/slurmd/job1134509/slurm_script: line 38: 5408 Killed python scvi_processing.py --test

alexanderchang1 avatar Apr 28 '25 21:04 alexanderchang1

It should be working. Im attaching a notebook that shows the basic tutorial is trained using multiGPU then saved and loaded again. https://app.reviewnb.com/scverse/scvi-tutorials/blob/Ori-custom-dataloaders-tutos/use_cases/multiGPU.ipynb/ (the same works from non jupyter also)

I suspect some environment/infrastructure issues: You mentioned you have problems updating packages? what about other packages that are used in the process such as anndata and torch. You mentioned Jax. -How is it related? Is saving and loading failed only after using multiGPU or also with non multigpu training? Did you start a new env? What is you multi-GPU infra?

ori-kron-wis avatar Apr 29 '25 07:04 ori-kron-wis

Hi,

I started a new environment. Here's my full conda list. Saving and loading is only failing after multiGPU training on 4 A100s, it may be an infrastructure issue with how my institution has set up the multi gpu cluster. I've also included my linux distro information in case that's the source, as that has happened with other softwares due to glibc distribution problems etc.

NAME="Red Hat Enterprise Linux Server"
VERSION="7.9 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.9"
PRETTY_NAME="Red Hat Enterprise Linux"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.9
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.9"
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
absl-py                   2.2.2                    pypi_0    pypi
aiohappyeyeballs          2.6.1                    pypi_0    pypi
aiohttp                   3.11.18                  pypi_0    pypi
aiosignal                 1.3.2                    pypi_0    pypi
anndata                   0.11.4                   pypi_0    pypi
array-api-compat          1.11.2                   pypi_0    pypi
async-timeout             5.0.1                    pypi_0    pypi
attrs                     25.3.0                   pypi_0    pypi
biomart                   0.9.2                    pypi_0    pypi
bzip2                     1.0.8                h4bc722e_7    conda-forge
ca-certificates           2025.4.26            hbd8a1cb_0    conda-forge
certifi                   2025.4.26                pypi_0    pypi
charset-normalizer        3.4.1                    pypi_0    pypi
chex                      0.1.89                   pypi_0    pypi
contourpy                 1.3.2                    pypi_0    pypi
cycler                    0.12.1                   pypi_0    pypi
docrep                    0.3.2                    pypi_0    pypi
etils                     1.12.2                   pypi_0    pypi
exceptiongroup            1.2.2                    pypi_0    pypi
filelock                  3.18.0                   pypi_0    pypi
flax                      0.10.4                   pypi_0    pypi
fonttools                 4.57.0                   pypi_0    pypi
frozenlist                1.6.0                    pypi_0    pypi
fsspec                    2025.3.2                 pypi_0    pypi
grpcio                    1.71.0                   pypi_0    pypi
h5py                      3.13.0                   pypi_0    pypi
humanize                  4.12.2                   pypi_0    pypi
idna                      3.10                     pypi_0    pypi
igraph                    0.11.8                   pypi_0    pypi
importlib-resources       6.5.2                    pypi_0    pypi
jax                       0.4.35                   pypi_0    pypi
jax-cuda12-pjrt           0.5.3                    pypi_0    pypi
jax-cuda12-plugin         0.5.3                    pypi_0    pypi
jaxlib                    0.4.35                   pypi_0    pypi
jinja2                    3.1.6                    pypi_0    pypi
joblib                    1.4.2                    pypi_0    pypi
kiwisolver                1.4.8                    pypi_0    pypi
ld_impl_linux-64          2.43                 h712a8e2_4    conda-forge
legacy-api-wrap           1.4.1                    pypi_0    pypi
leidenalg                 0.10.2                   pypi_0    pypi
libexpat                  2.7.0                h5888daf_0    conda-forge
libffi                    3.4.6                h2dba641_1    conda-forge
libgcc                    14.2.0               h767d61c_2    conda-forge
libgcc-ng                 14.2.0               h69a702a_2    conda-forge
libgomp                   14.2.0               h767d61c_2    conda-forge
liblzma                   5.8.1                hb9d3cd8_0    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libsqlite                 3.49.1               hee588c1_2    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libzlib                   1.3.1                hb9d3cd8_2    conda-forge
lightning                 2.5.1.post0              pypi_0    pypi
lightning-utilities       0.14.3                   pypi_0    pypi
llvmlite                  0.44.0                   pypi_0    pypi
markdown                  3.8                      pypi_0    pypi
markdown-it-py            3.0.0                    pypi_0    pypi
markupsafe                3.0.2                    pypi_0    pypi
matplotlib                3.10.1                   pypi_0    pypi
mdurl                     0.1.2                    pypi_0    pypi
ml-collections            1.1.0                    pypi_0    pypi
ml-dtypes                 0.5.1                    pypi_0    pypi
mpmath                    1.3.0                    pypi_0    pypi
msgpack                   1.1.0                    pypi_0    pypi
mudata                    0.3.1                    pypi_0    pypi
multidict                 6.4.3                    pypi_0    pypi
multipledispatch          1.0.0                    pypi_0    pypi
natsort                   8.4.0                    pypi_0    pypi
ncurses                   6.5                  h2d0b736_3    conda-forge
nest-asyncio              1.6.0                    pypi_0    pypi
networkx                  3.4.2                    pypi_0    pypi
numba                     0.61.2                   pypi_0    pypi
numpy                     2.2.5                    pypi_0    pypi
numpyro                   0.18.0                   pypi_0    pypi
nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
nvidia-cuda-nvcc-cu12     12.8.93                  pypi_0    pypi
nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
nvidia-cusparselt-cu12    0.6.2                    pypi_0    pypi
nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
openssl                   3.5.0                h7b32b05_0    conda-forge
opt-einsum                3.4.0                    pypi_0    pypi
optax                     0.2.4                    pypi_0    pypi
orbax-checkpoint          0.11.5                   pypi_0    pypi
packaging                 24.2                     pypi_0    pypi
pandas                    2.2.3                    pypi_0    pypi
patsy                     1.0.1                    pypi_0    pypi
pillow                    11.2.1                   pypi_0    pypi
pip                       25.1               pyh8b19718_0    conda-forge
propcache                 0.3.1                    pypi_0    pypi
protobuf                  6.30.2                   pypi_0    pypi
pygments                  2.19.1                   pypi_0    pypi
pynndescent               0.5.13                   pypi_0    pypi
pyparsing                 3.2.3                    pypi_0    pypi
pyro-api                  0.1.2                    pypi_0    pypi
pyro-ppl                  1.9.1                    pypi_0    pypi
python                    3.10.17         hd6af730_0_cpython    conda-forge
python-dateutil           2.9.0.post0              pypi_0    pypi
pytorch-lightning         2.5.1.post0              pypi_0    pypi
pytz                      2025.2                   pypi_0    pypi
pyyaml                    6.0.2                    pypi_0    pypi
readline                  8.2                  h8c095d6_2    conda-forge
requests                  2.32.3                   pypi_0    pypi
rich                      14.0.0                   pypi_0    pypi
scanpy                    1.11.1                   pypi_0    pypi
scikit-learn              1.5.2                    pypi_0    pypi
scikit-misc               0.5.1                    pypi_0    pypi
scipy                     1.15.2                   pypi_0    pypi
scvi-tools                1.3.0                    pypi_0    pypi
seaborn                   0.13.2                   pypi_0    pypi
session-info2             0.1.2                    pypi_0    pypi
setuptools                79.0.1             pyhff2d567_0    conda-forge
simplejson                3.20.1                   pypi_0    pypi
six                       1.17.0                   pypi_0    pypi
sparse                    0.16.0                   pypi_0    pypi
statsmodels               0.14.4                   pypi_0    pypi
sympy                     1.13.1                   pypi_0    pypi
tensorboard               2.19.0                   pypi_0    pypi
tensorboard-data-server   0.7.2                    pypi_0    pypi
tensorstore               0.1.74                   pypi_0    pypi
texttable                 1.7.0                    pypi_0    pypi
threadpoolctl             3.6.0                    pypi_0    pypi
tk                        8.6.13          noxft_h4845f30_101    conda-forge
toolz                     1.0.0                    pypi_0    pypi
torch                     2.6.0                    pypi_0    pypi
torchmetrics              1.7.1                    pypi_0    pypi
tqdm                      4.67.1                   pypi_0    pypi
treescope                 0.1.9                    pypi_0    pypi
triton                    3.2.0                    pypi_0    pypi
typing-extensions         4.13.2                   pypi_0    pypi
tzdata                    2025.2                   pypi_0    pypi
umap-learn                0.5.7                    pypi_0    pypi
urllib3                   2.4.0                    pypi_0    pypi
werkzeug                  3.1.3                    pypi_0    pypi
wheel                     0.45.1             pyhd8ed1ab_1    conda-forge
xarray                    2025.3.1                 pypi_0    pypi
yarl                      1.20.0                   pypi_0    pypi
zipp                      3.21.0                   pypi_0    pypi

alexanderchang1 avatar Apr 29 '25 12:04 alexanderchang1

My only guess right now is the infrastructure. We work with Ubuntu here. We need to make sure its working for Red Hat also (although I think it's not the OS in your case)

ori-kron-wis avatar Apr 29 '25 12:04 ori-kron-wis

To clarify, it runs smoothly with a single GPU. @ilan-gold to me this looks more like an AnnData issue. It’s trying to use multiple file streams to write the file(?). Any recommendation to avoid this?

canergen avatar Apr 29 '25 14:04 canergen

Without knowing anything about the issue, Can, I don't think h5 is threadsafe (and using it in processes for writing is probably also not safe). But I would need to dig into the issue to know more.

ilan-gold avatar Apr 29 '25 14:04 ilan-gold

Yes it runs smoothly with a single GPU. I believe the distribution across 4 GPUs means they don't all exit at the same time, and at somepoint in the code two different GPUs try to get to the save code at the same time and that's where it breaks, I don't know enough about the ddp parallelization to read deeper than that.

alexanderchang1 avatar Apr 29 '25 14:04 alexanderchang1

BTW I'm not saying it won't work with processes, it's just that I don’t know. We work read (not write) h5 in processes by reopening the file which bypasses the lock (meant for preventing threaded usage). But I dont know what happens when you write from multiple processes. I could come up with scenarios where it might work and where it might not

ilan-gold avatar Apr 29 '25 14:04 ilan-gold

Zarr would be my recommendation for avoiding the issue but writing into one store with from multiple sources is tricky independent of the file backend

ilan-gold avatar Apr 29 '25 14:04 ilan-gold

Im closing it here, but keeping this remark if such a thing happens again:

Hi, my institutional HPC has issues installing jax and scvi, so I'm loathe to attempt messing with it. Wrapping the save block of code with this line fixed the issue for me.

`

if vae.trainer.is_global_zero:
    # Save results
    print("Saving results before clustering (rank 0)...")

    vae.save(dir_path="SCVI_Model_integrated_before_clustering", save_anndata=True, overwrite=True)
    adata_combined.write('Total_Dataset_SCVI_integrated_before_clustering.h5ad')

    # Perform clustering
    sc.pp.neighbors(adata_combined, use_rep="X_scVI")
    sc.tl.leiden(adata_combined)

    # Save final results
    print("Saving final results (rank 0)...")
    adata_combined.write('Total_Dataset_SCVI_integrated.h5ad')
    vae.save(dir_path="SCVI_Model_integrated", save_anndata=True, overwrite=True)

    print("Process completed successfully on rank 0.")
else:
    print(f"Process skipping save/clustering on rank {vae.trainer.global_rank}.")

`

ori-kron-wis avatar Aug 03 '25 11:08 ori-kron-wis