scenicplus icon indicating copy to clipboard operation
scenicplus copied to clipboard

Stuck at GSEA Step

Open li-xuyang28 opened this issue 2 years ago • 10 comments
trafficstars

I am following the 10X Genomics PBMC tutorial and running the wrapper function. Everything was fine until the GSEA step, it has been stuck for over 40 hours

2023-04-26 17:32:13,593 GSEA         INFO     Subsetting TF2G adjacencies for TF with motif.
2023-04-26 17:32:19,727	INFO worker.py:1544 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
2023-04-26 17:32:20,376 GSEA         INFO     Running GSEA...
initializing:  23%|██▎       | 7094/31183 [23:37<05:21, 74.95it/s]  

When looking at the node log, it does raise an error message about node overloaded, terminated or the network is slow. But the memory usage showing in the cluster is well below 10%

204692023-04-27 13:24:41,619	ERROR node_head.py:302 -- Cannot reach the node, c96dd5c6ab1a61bc93d4ee80eff792af1b8762ed22e5afa5eb6cbef5, after timeout 4. This node may have been overloaded, terminated, or the network is slow.20470NoneType: None204712023-04-27 13:24:48,627	ERROR node_head.py:302 -- Cannot reach the node, c96dd5c6ab1a61bc93d4ee80eff792af1b8762ed22e5afa5eb6cbef5, after timeout 4. This node may have been overloaded, terminated, or the network is slow.20472NoneType: None204732023-04-27 13:24:51,920	INFO web_log.py:206 -- 127.0.0.1 [27/Apr/2023:17:24:51 +0000] 'GET /nodes?view=summary HTTP/1.1' 200 9532 bytes 6260 us 'http://127.0.0.1:8265/' 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'204742023-04-27 13:24:51,923	INFO web_log.py:206 -- 127.0.0.1 [27/Apr/2023:17:24:51 +0000] 'GET /nodes/c96dd5c6ab1a61bc93d4ee80eff792af1b8762ed22e5afa5eb6cbef5 HTTP/1.1' 200 9871 bytes 1948 us 'http://127.0.0.1:8265/' 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'204752023-04-27 13:24:54,614	INFO web_log.py:206 -- 127.0.0.1 [27/Apr/2023:17:24:54 +0000] 'GET /log_index HTTP/1.1' 200 391 bytes 43230 us 'http://127.0.0.1:8265/' 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'204762023-04-27 13:24:55,633	ERROR node_head.py:302 -- Cannot reach the node, c96dd5c6ab1a61bc93d4ee80eff792af1b8762ed22e5afa5eb6cbef5, after timeout 4. This node may have been overloaded, terminated, or the network is slow.20477NoneType: None204782023-04-27 13:24:56,226	INFO web_log.py:206 -- 127.0.0.1 [27/Apr/2023:17:24:56 +0000] 'GET /log_proxy?url=http%3A%2F%2F127.0.0.1%3A52365%2Flogs HTTP/1.1' 200 3130 bytes 103802 us 'http://127.0.0.1:8265/' 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'204792023-04-27 13:24:58,168	INFO web_log.py:206 -- 127.0.0.1 [27/Apr/2023:17:24:58 +0000] 'GET /log_proxy?url=http%3A%2F%2F127.0.0.1%3A52365%2Flogs%2Fdashboard.err HTTP/1.1' 200 660 bytes 8014 us 'http://127.0.0.1:8265/' 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'204802023-04-27 13:25:01,210	INFO web_log.py:206 -- 127.0.0.1 [27/Apr/2023:17:25:01 +0000] 'GET /log_proxy?url=http%3A%2F%2F127.0.0.1%3A52365%2Flogs HTTP/1.1' 200 3130 bytes 5785 us 'http://127.0.0.1:8265/' 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'20481

There seem to be activities going on in the cluster based on ray dashboard, but it has been stuck at 7094/31183 and I couldn't figure out why it is taking 40h+.

li-xuyang28 avatar Apr 27 '23 17:04 li-xuyang28

Hi @li-xuyang28

hmm... 40hrs is really very long..

How many cores were you using?

Best,

Seppe

SeppeDeWinter avatar May 08 '23 13:05 SeppeDeWinter

Hi,

I'm running on 8 cores (less than the 12 suggested by the tutorial), was running it locally (on an iMAC, because I was having so much trouble getting ray to work on the cluster I have access to). Somehow the entire thing was just extremely slow for me. (I restated the process but still stuck at the step hmmm)

This was the 10X PBMC multiome data, but I did change the cell type annotation a bit (divided into a bit more T cell subtypes).

Best, Yang

li-xuyang28 avatar May 09 '23 02:05 li-xuyang28

Hi again @SeppeDeWinter ,

I tried subsetting the object to run the build_grn function several times with the 10X PBMC data, but it all got stuck during initializing (at around 16166/18918); it takes about 4 minutes to go through the ones that were processed (consistent with the tutorial), then was forever stuck (>24h). According to ray dashboard there were still activities going on, but the nodes seemed to be idle. Is there any information I could provide to help with figuring out what happened with it?

Best, Yang

li-xuyang28 avatar May 10 '23 17:05 li-xuyang28

There might be a chance that one of the worker processes was crashed and that ray didn't detect it and assumes it is still running. Try with less cores or with a better machine.

ghuls avatar Jun 20 '23 16:06 ghuls

Hi, is the problem solved? I also met the problem

CYorick avatar Oct 26 '23 13:10 CYorick

Hi @CYorick

It might be memory related. The code in the development branch is more memory friendly.

See https://github.com/aertslab/scenicplus/discussions/202 on how to use it.

All the best,

Seppe

SeppeDeWinter avatar Oct 26 '23 13:10 SeppeDeWinter

Hi @CYorick

It might be memory related. The code in the development branch is more memory friendly.

See #202 on how to use it.

All the best,

Seppe

Thanks for your reply. Should I simply download the Snakemake dictionary without changing anything else, and run the whole pipeline automatically? What if I just want to run the function build_grn?

Best, Yorick

CYorick avatar Oct 26 '23 17:10 CYorick

The problem can be solved by setting the "ray_n_cpu" as None

CYorick avatar Oct 26 '23 18:10 CYorick

What do you mean set "ray_n_cpu" as None? Using a single core?

I've tried to solve at it says , clean the temporal directory and re-run the code. But it has been impossible, and I have 600 GB of space. Here are the errors that appear me.

(_ray_run_gsea_for_e_module pid=959428) /home/roger/anaconda3/envs/scenicplus/lib/python3.8/site-packages/gseapy/algorithm.py:87: RuntimeWarning: divide by zero encountered in divide
(_ray_run_gsea_for_e_module pid=959428)   norm_tag = 1.0 / sum_correl_tag
(_ray_run_gsea_for_e_module pid=959428) /home/roger/anaconda3/envs/scenicplus/lib/python3.8/site-packages/gseapy/algorithm.py:91: RuntimeWarning: invalid value encountered in multiply
(_ray_run_gsea_for_e_module pid=959428)   tag_indicator * correl_vector * norm_tag - no_tag_indicator * norm_no_tag,
(raylet) Spilled 5732 MiB, 13998 objects, write throughput 2560 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.```

If anybody has encountered the same issue or could help me, would be great. 

Thank you.

rogercasalsfr avatar Dec 18 '23 09:12 rogercasalsfr

Hi @rogercasalsfr

I would also suggest to use the development version of the code. See https://github.com/aertslab/scenicplus/discussions/202 for more info.

All the best,

Seppe

SeppeDeWinter avatar Dec 21 '23 13:12 SeppeDeWinter