Synchronous dataloader write fails

Open yugeji opened this issue 1 year ago • 1 comments


python==3.9.0 install pertpy install snakemake

Within the Snakefile:

rule prepare_data:
        output: TMPDIR / 'prepared_{dataset}.h5ad'
                import os
                os.environ["HDF5_USE_FILE_LOCKING"] = "FALSE"
                import pertpy as pt
                dataset = wildcards.dataset

                if dataset in ['sciplex_K562', 'sciplex_A549', 'sciplex_MCF7']:
                        cell_line = dataset.split('_')[1]
                        adata =

Because all three dataset values were run at the same time, was run in three different threads. Since the file was not pre-downloaded, all threads began downloading, causing a lock to be called on the file, preventing any thread from completing the download. Including the os.environ["HDF5_USE_FILE_LOCKING"] = "FALSE" line does not fix this.

Version information

pertpy 0.5.0 session_info 1.0.0

pertpy 0.5.0 session_info 1.0.0

Python 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32) [GCC 12.3.0] Linux-4.18.0-477.27.1.el8_8.x86_64-x86_64-with-glibc2.28

Session information updated at 2023-10-24 21:56

yugeji avatar Oct 24 '23 20:10 yugeji

I think that a Filelock might help here. If not, one needs to pre-download the datasets

Zethson avatar Nov 03 '23 13:11 Zethson