pySCENIC
pySCENIC copied to clipboard
Error: Unable to allocate 3.11 GiB for an array with shape (17627, 23657) and data type float64
Hello, I am using CLI to run the pyscenic. and it seems it has been installed properly. I wanted to run the grn for a big dataset (17627, 23657) and after loading the expression matrix, I dealt with the following error. Does anyone have idea how i can fix it?
I play a lot with the --cpus-per-task and --mem-per-cpu, but error is repeated.
I would appreciate if anyone have an idea about it.
2022-08-21 00:52:03,871 - pyscenic.cli.pyscenic - INFO - Loading expression matrix. Traceback (most recent call last): File "/opt/venv/bin/pyscenic", line 8, in sys.exit(main()) File "/opt/venv/lib/python3.7/site-packages/pyscenic/cli/pyscenic.py", line 675, in main args.func(args) File "/opt/venv/lib/python3.7/site-packages/pyscenic/cli/pyscenic.py", line 61, in find_adjacencies_command args.gene_attribute, File "/opt/venv/lib/python3.7/site-packages/pyscenic/cli/utils.py", line 143, in load_exp_matrix df = pd.read_csv(fname, sep=suffixes_to_separator(extension), header=0, index_col=0) File "/opt/venv/lib/python3.7/site-packages/pandas/io/parsers.py", line 610, in read_csv return _read(filepath_or_buffer, kwds) File "/opt/venv/lib/python3.7/site-packages/pandas/io/parsers.py", line 468, in _read return parser.read(nrows) File "/opt/venv/lib/python3.7/site-packages/pandas/io/parsers.py", line 1069, in read df = DataFrame(col_dict, columns=columns, index=index) File "/opt/venv/lib/python3.7/site-packages/pandas/core/frame.py", line 529, in init mgr = init_dict(data, index, columns, dtype=dtype) File "/opt/venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 287, in init_dict return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype) File "/opt/venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 95, in arrays_to_mgr return create_block_manager_from_arrays(arrays, arr_names, axes) File "/opt/venv/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1701, in create_block_manager_from_arrays blocks = _form_blocks(arrays, names, axes) File "/opt/venv/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1759, in _form_blocks float_blocks = _multi_blockify(items_dict["FloatBlock"]) File "/opt/venv/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1852, in _multi_blockify values, placement = _stack_arrays(list(tup_block), dtype) File "/opt/venv/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1880, in _stack_arrays stacked = np.empty(shape, dtype=dtype) numpy.core._exceptions.MemoryError: Unable to allocate 3.11 GiB for an array with shape (17627, 23657) and data type float64
Hi @Erfan1369
The error numpy.core._exceptions.MemoryError: Unable to allocate 3.11 GiB for an array with shape (17627, 23657) and data type float64
suggests you don't have enough memory available on your system. What type of system are you using (how much memory do you have available)?
Hi,
I run the pipeline on linux using Beluga cluster(CLI), Here is the memory information: total used free shared buff/cache available Mem: 187Gi 53Gi 102Gi 6.3Gi 31Gi 125Gi Swap: 0B 0B 0B
and, the commands in the submitted job as a .bash:
#!/bin/bash #SBATCH --time=03:00:00 #SBATCH --nodes=1 #SBATCH --cpus-per-task=7 #SBATCH --mem-per-cpu=3GB #SBATCH --account=def-gevrynic #SBATCH --job-name=pycenic_test1 #SBATCH --output=%x-%j.out
pyscenic grn -o GCs_genie3_singularity1.csv -m genie3 -t --num_workers 7 GCs_all_cluster.csv tfs_fname.txt
I have played with the --cpus-per-task and --mem-per-cpu(up to 1T), and --num_workers (i.e 7, 20, 40)as well, but did not worked.
Looking at: https://researchcomputing.princeton.edu/support/knowledge-base/memory is seems that you specified the mem-per-cpu
wrong (should be 3G
instead of 3GB
)
In the memory allocation error you got, you can also see that you need slightly more than 3GiB of memory (just for that allocation). For the actual calculation you still will need more.
Try something like this:
#!/bin/bash
#SBATCH --time=03:00:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=7
#SBATCH --mem-per-cpu=6G
#SBATCH --account=def-gevrynic
#SBATCH --job-name=pycenic_test1
#SBATCH --output=%x-%j.out
pyscenic grn -o GCs_genie3_singularity1.csv -m genie3 -t --num_workers 7 GCs_all_cluster.csv tfs_fname.txt
Depending on how much memory you are allowed to allocate per CPU (e.g. might be limited to 4G per CPU), you might need to request more cpu-per-tasks (e.g. 14) and set the number of workers for pyscenic much lower (e.g. 7) so each pyscenic thread has double the amount of memory than specified than --mem-per-cpu=3G
.
#!/bin/bash
#SBATCH --time=03:00:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=14
#SBATCH --mem-per-cpu=3G
#SBATCH --account=def-gevrynic
#SBATCH --job-name=pycenic_test1
#SBATCH --output=%x-%j.out
pyscenic grn -o GCs_genie3_singularity1.csv -m genie3 -t --num_workers 7 GCs_all_cluster.csv tfs_fname.txt
In the job output, you normally also should have a section that tells you how much memory was reserved, Probably it was not the amount you expected.
Sorry for the delay and thanks for the respond!
-
Actually, I treid with your recomendation and also different memories,(up to 256G) using both singularity and virtual env, but the problem remained unsolved and still I have the same error.
-
about the stochastic nature of the pyscenic, I was wondering if whether it is only related to the first step(grn) of SCENIC or it is for whole of the pipeline? to put it in another way, does running pyscenic grn multiple times(10-100) and filtering out the tf-targets manually with a reproducible threshold(i.e %80) would be a right strategy to reduce stochastic effect?
regards,
Erfan
Hi!
However I have not received more response from developers of pyscenic, I would like to share another problem during running pyscenic on CLI so that might be others problem or solved by other researchers who use pyscenic.
I was trying to run pyscenic on CLI using the following setting,
#SBATCH --time=72:00:00 #SBATCH --nodes=1 #SBATCH --cpus-per-task=7 #SBATCH --mem=55G #SBATCH --output=%x-%j.out
module load singularity
singularity exec ./pySCENIC_0.12.0.sif pyscenic grn -o /GCs_grn/GCs_genie3_singularity1.csv -m genie3 -t --num_workers 7 GCs_all_cluster.csv tfs_fname.txt**
after 3 days, the following warnings piled up in the .out file. I know it is not an error and its a warning, but I don't whether this affect the result or not(although the jobs crashed after running out of time).
2022-08-26 10:52:40,868 - pyscenic.cli.pyscenic - INFO - Inferring regulatory networks. 2022-08-28 03:04:58,953 - distributed.worker_memory - WARNING - gc.collect() took 1.446s. This is usually a sign that some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help. 2022-08-28 03:04:59,513 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.48 GiB -- Worker memory limit: 7.86 GiB 2022-08-28 06:55:44,175 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.51 GiB -- Worker memory limit: 7.86 GiB 2022-08-28 06:55:44,225 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.51 GiB -- Worker memory limit: 7.86 GiB 2022-08-28 06:55:44,318 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.51 GiB -- Worker memory limit: 7.86 GiB 2022-08-28 06:55:44,418 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.51 GiB -- Worker memory limit: 7.86 GiB 2022-08-28 06:55:44,513 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.51 GiB -- Worker memory limit: 7.86 GiB 2022-08-28 06:55:44,614 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.51 GiB -- Worker memory limit: 7.86 GiB . . . . . . . 2022-08-28 23:39:04,777 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.50 GiB -- Worker memory limit: 7.86 GiB 2022-08-28 23:39:04,877 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.50 GiB -- Worker memory limit: 7.86 GiB
As I mentioned in previous posts, the excel size that I use is too big(17627 cells and 23657 genes), and I have played too much with the memory, and cpu per memory in the bash file.
I would be thankful if any one has experinced the same situation, help me to navigate the issue.
Which version of pySCENIC do you have installed?
The latter is not a bug (see the dask webpage).
Thanks for the reply,
I use singulartity pySCENIC_0.12.0.sif,
Also, do you have any idea about this?
About the stochastic nature of the pyscenic, I was wondering if whether it is only related to the first step(grn) of SCENIC or it is for whole of the pipeline? to put it in another way, does running pyscenic grn multiple times(10-100) and filtering out the tf-targets manually with a reproducible threshold(i.e %80) would be a right strategy to reduce stochastic effect?
About the stochastic nature of the pyscenic, I was wondering if whether it is only related to the first step(grn) of SCENIC or it is for whole of the pipeline? to put it in another way, does running pyscenic grn multiple times(10-100) and filtering out the tf-targets manually with a reproducible threshold(i.e %80) would be a right strategy to reduce stochastic effect?
Yes running it 100 times and only keep links that reoccur more than e.g. 10% is a good strategy.