pySCENIC icon indicating copy to clipboard operation
pySCENIC copied to clipboard

prune2df runing for more than 140h

Open JPcerapio opened this issue 5 years ago • 22 comments

Hello, so I managed to get until the Phase II of your Tutorial with your data.

But after running 145h I stopped the process. I don't know if it is normal that it runs that long.

Thanks for your help.

Jp

Here some info,

dbs [FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr"), FeatherRankingDatabase(name="mm9-tss-centered-5kb-7species.mc9nr"), FeatherRankingDatabase(name="mm9-tss-centered-10kb-7species.mc9nr"), FeatherRankingDatabase(name="mm9-500bp-upstream-10species.mc9nr"), FeatherRankingDatabase(name="mm9-tss-centered-10kb-10species.mc9nr"), FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr")]

PHASE I network = grnboost2(expression_data=ex_matrix2, gene_names=gene_names, tf_names=tf_names)#6h of runing

modules = list(modules_from_adjacencies(network, ex_matrix))

PHASE II

with ProgressBar(): df = prune2df(dbs, modules, "/home/user/pySCENIC/data_bases/Mm/motifs-v9-nr.mgi-m0.001-o0.0.tbl")

[####################################### ] | 98% Completed | 25min 37.2s 2020-02-12 15:05:46,854 - pyscenic.transform - WARNING - Less than 80% of the genes in Tcf21 could be mapped to mm9-tss-centered-5kb-10species.mc9nr. Skipping this module. [####################################### ] | 98% Completed | 25min 45.6s 2020-02-12 15:05:55,227 - pyscenic.transform - WARNING - Less than 80% of the genes in Mef2d could be mapped to mm9-tss-centered-5kb-10species.mc9nr. Skipping this module. [####################################### ] | 98% Completed | 25min 46.4s 2020-02-12 15:05:56,007 - pyscenic.transform - WARNING - Less than 80% of the genes in Meox2 could be mapped to mm9-tss-centered-5kb-10species.mc9nr. Skipping this module. [####################################### ] | 99% Completed | 18hr 5min 13.5s [####################################### ] | 99% Completed | 145hr 0min 28.9s^CProcess ForkPoolWorker-446:

JPcerapio avatar Feb 26 '20 10:02 JPcerapio

I having a similar issue. The progress bar creeps up relatively fast to a point and subsequently stalls. No error message but no output either.

Happened both on Linux and Anaconda on Windows.

jk86754 avatar Mar 01 '20 14:03 jk86754

Hello @jk86754 , but did you let it end?, because i have to stopped it, I think 145h is quite a lot for a small set of samples.

Jp

JPcerapio avatar Mar 02 '20 07:03 JPcerapio

Hi @JPcerapio , @jk86754 ,

This step should definitely not take 145 hours. This seems to be a bug in the pruning step, similar to #104 . Running this step via the CLI seems to have worked for others, could you try this?

cflerin avatar Mar 02 '20 10:03 cflerin

Hey @cflerin thanks for your answer, I will try it but with this option the problem is that we do not have access to intermediates files or results that we will like to have.

I don't know if someone figure out if the error is coming from a some missing dependence or library.

Jp

JPcerapio avatar Mar 02 '20 12:03 JPcerapio

Hi, @JPcerapio , which intermediate files are you referring to? When you run this step in the CLI, you can still get the motif and regulon information. Although the CLI outputs only one of these, you can convert to the other without re-running, for example: #100

cflerin avatar Mar 03 '20 09:03 cflerin

hello, I am using the CLI of pyscenic and "creating regulons" has been running for over a week. 2020-04-06 09:15:03,025 - pyscenic.cli.pyscenic - INFO - Calculating regulons. My data set is quite big (69,000 cells and 27,000 genes) but I am running on a cluster with 64 cores and 1Tb of RAM.

thanks for your help, morgane

morganee261 avatar Apr 14 '20 17:04 morganee261

Hi @morganee261 ,

Have you solved this problem? I'm also running the CLI (pyscenic ctx) and it's taking a long time.

Thanks, Boxun

liboxun avatar May 06 '20 23:05 liboxun

Hi @liboxun,

Unfortunately no, I haven't had any luck. It has been (and is still) running for a month now and I did not get an answer from the developers of this package. thanks,

Morgane

morganee261 avatar May 06 '20 23:05 morganee261

Hi @morganee261 , @liboxun ,

This step should definitely not take this long. If it's been running for a month there's clearly something wrong and I would stop it.

I've seen this issue a few times before, but I haven't been able to reproduce the problem to see where and why this step hangs, so I can't offer you a good solution. A few suggestions:

  • Try the Docker image, which has been working reliably for me recently. This would (hopefully) address any package version conflicts. (See here).
  • Try the CLI version of pyscenic ctx (see here).
  • Restart the process if this step seems to hang. For a dataset of 10k cells and 20k genes, this should run in ~10 minutes using 20 processes and two gene-based feather databases (human).
  • Try running with just a single feather database

cflerin avatar May 08 '20 19:05 cflerin

Thanks a lot @cflerin ! Since I'm already running the CLI version, I'll try switching to the Docker image or using just a single feather database.

I'll update this here when things come out.

liboxun avatar May 08 '20 20:05 liboxun

Hello @cflerin,

I have been running the CLI of pyscenic ctx and that is what got stuck running for over a month. I stopped and I started running it with a single feather database.

I am also trying to run the docker image but I am not very familiar with it and I run into an error :

docker run -it --rm \

-v /home/Morgane/mapping/int:/scenicdata \
aertslab/pyscenic:[version] pyscenic grn \
    --num_workers 20 \
    -o /scenicdata/expr_mat.adjacencies.tsv \
    /scenicdata/ex_matrix.csv \
    /scenicdata/hgnc_tfs.txt

docker: invalid reference format. See 'docker run --help'.

could you please advise?

thanks, for your reply and your help,

Morgane

morganee261 avatar May 11 '20 21:05 morganee261

Hi @cflerin ,

I went back to ran the CLI with a single feather database, and it didn't help. It still got stuck forever at:

2020-05-08 15:14:50,014 - pyscenic.utils - INFO - Creating modules.

2020-05-08 15:16:46,513 - pyscenic.cli.pyscenic - INFO - Loading databases.

2020-05-08 15:16:46,515 - pyscenic.cli.pyscenic - INFO - Calculating regulons. slurmstepd: error: *** JOB 1697596 ON NucleusA007 CANCELLED AT 2020-05-10T15:14:05 DUE TO TIME LIMIT ***

But when I tried using Singularity image (since Docker isn't available on our HPC system) of pySCENIC 0.10.0, it certainly helped. Now I actually got an progress bar, despite its failing at 57%:

[###################### ] | 57% Completed | 3hr 9min 9.2s

It failed because of it ran out of memory:

2020-05-08 21:23:27,584 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for ZNF165 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.

2020-05-08 21:24:06,771 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for ZNF2 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.

2020-05-08 21:47:28,929 - pyscenic.transform - ERROR - Unable to process "Regulon for NFKB1" on database "hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr" because ran out of memory. Stacktrace:

2020-05-08 21:47:31,092 - pyscenic.transform - ERROR - Unable to process "Regulon for ZNF81" on database "hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr" because ran out of memory. Stacktrace:

2020-05-08 21:47:51,126 - pyscenic.transform - ERROR - Traceback (most recent call last): File "/opt/venv/lib/python3.7/site-packages/pyscenic/transform.py", line 185, in module2df weighted_recovery=weighted_recovery) File "/opt/venv/lib/python3.7/site-packages/pyscenic/transform.py", line 159, in module2features_auc1st_impl avg2stdrcc = avgrcc + 2.0 * rccs.std(axis=0) File "/opt/venv/lib/python3.7/site-packages/numpy/core/_methods.py", line 217, in _std keepdims=keepdims) File "/opt/venv/lib/python3.7/site-packages/numpy/core/_methods.py", line 193, in _var x = asanyarray(arr - arrmean) MemoryError: Unable to allocate array with shape (24453, 5000) and data type float64

2020-05-08 21:47:51,441 - pyscenic.transform - ERROR - Traceback (most recent call last): File "/opt/venv/lib/python3.7/site-packages/pyscenic/transform.py", line 185, in module2df weighted_recovery=weighted_recovery) File "/opt/venv/lib/python3.7/site-packages/pyscenic/transform.py", line 159, in module2features_auc1st_impl avg2stdrcc = avgrcc + 2.0 * rccs.std(axis=0) File "/opt/venv/lib/python3.7/site-packages/numpy/core/_methods.py", line 217, in _std keepdims=keepdims) File "/opt/venv/lib/python3.7/site-packages/numpy/core/_methods.py", line 193, in _var x = asanyarray(arr - arrmean) MemoryError: Unable to allocate array with shape (24453, 5000) and data type float64

Bus error

I used a node with 32GB memory, with 32 workers. Is that too little? What would you recommend?

Thanks! Boxun

liboxun avatar May 12 '20 15:05 liboxun

Hi @liboxun,

I got it to run in less than 14 min by using the docker image. I used 20 cores so the more the better I think. But here is my code (note that the whole code is in 1 line without "", the code that is on the tutorial did not work for me)

sudo docker pull aertslab/pyscenic:0.10.0

sudo docker run -it --rm -v /path/to/data:/scenicdata aertslab/pyscenic:0.10.0 pyscenic grn --num_workers 20 --transpose -o /scenicdata/expr_mat.adjacencies.tsv /scenicdata/ex_matrix.csv /scenicdata/hgnc_tfs.txt

I have to transpose my expression matrix to get it in the right format but you might not have to

sudo docker run -it --rm -v /path/to/data:/scenicdata aertslab/pyscenic:0.10.0 pyscenic ctx scenicdata/expr_mat.adjacencies.tsv /scenicdata/hg19-tss-centered-10kb-7species.mc9nr.feather /scenicdata/hg19-500bp-upstream-7species.mc9nr.feather --annotations_fname /scenicdata/motifs-v9-nr.hgnc-m0.001-o0.0.tbl --expression_mtx_fname /scenicdata/ex_matrix.csv --transpose --mode "dask_multiprocessing" --output /scenicdata/regulons.csv --num_workers 20

#this ran is 14 min on a server with 1Tb of RAM and using 20 out of 64 cores

sudo docker run -it --rm -v /path/to/data:/scenicdata aertslab/pyscenic:0.10.0 pyscenic aucell /scenicdata/ex_matrix.csv --transpose /scenicdata/regulons.csv -o /scenicdata/auc_mtx.csv --num_workers 20

#this took less than 10 min

hope this helps!

morgane

morganee261 avatar May 12 '20 16:05 morganee261

Hi @morganee261 ,

Thanks for that tip! Glad to hear it eventually worked for you.

I also got it to run (~23min) when I bumped the task over to a node with 128GB of memory (using 32 out of 32 cores).

Best, Boxun

liboxun avatar May 12 '20 19:05 liboxun

Hi @cflerin

I am trying to import the results of the CLI pyscenic (3 csv files) into R for further analysis but I am having a lot of problems.

it seems like having a loom file for the importation helps however your CLI tutorial exports as csv.

could you please provide a brief tutorial on how to import them into R to be able to run the rest of the SCENIC script and look at the data?

thanks for your help,

Morgane

morganee261 avatar May 12 '20 23:05 morganee261

Hi @liboxun

I am having issues with the downstream analysis. I was wondering what platform you were using and if you had any luck with it. I have imported a loom file into R but the format is very different from the tutorial.

Thanks, Morgane

morganee261 avatar May 13 '20 16:05 morganee261

Hi @morganee261 ,

I use Python. I haven't done any downstream analysis yet. I'll let you know how it goes in the next couple of weeks.

Best of luck, Boxun

liboxun avatar May 13 '20 19:05 liboxun

Hi @morganee261 ,

I was able to run the example jupyter notebook successfully for 10x PBMC dataset:

https://github.com/aertslab/SCENICprotocol/blob/master/notebooks/PBMC10k_downstream-analysis.ipynb

This notebook was written in Python, and was meant for analysis downstream of pyscenic grn and pyscenic ctx (i.e. after you generate adj.tsv and regulons.csv).

While there were several issues (some were due to wrong versions of dependencies, which thankfully were easy enough for me to fix by myself), I could largely run through the notebook smoothly.

Hopefully this helps! I'm not sure if there's an equivalent example in R, but I'd assume there is, since the original SCENIC was written in R.

Best, Boxun

liboxun avatar May 21 '20 16:05 liboxun

Hi @morganee261 ,

Thanks for that tip! Glad to hear it eventually worked for you.

I also got it to run (~23min) when I bumped the task over to a node with 128GB of memory (using 32 out of 32 cores).

Best, Boxun

Hi @liboxun, I met the same problem with you. The progress bar creeps up relatively fast to a 97% and subsequently stalls there. No error message but no output either. I noticed my 64G RAM was runout and no RAM was released. It seems that there was a bug eat all the memory. Could you kindly tell me how did you finally work it out? Did you use the docker image?Use only one feather? or just jump to a powerful computer? By the way, could you tell me the version you used? like python, cli, jupyter, and so on.

Many thanks.

Weijian

ureyandy2009 avatar Jul 03 '20 07:07 ureyandy2009

Hi @ureyandy2009 ,

For me, a combination of two changes worked:

  1. I switched from CLI to Singularity image (Docker image should work the same way);
  2. I used a computer with 128GB RAM instead of 32GB.

Hopefully this helps!

Best, Boxun

liboxun avatar Jul 03 '20 15:07 liboxun

Hi @ureyandy2009 ,

For me, a combination of two changes worked:

1. I switched from CLI to Singularity image (Docker image should work the same way);

2. I used a computer with 128GB RAM instead of 32GB.

Hopefully this helps!

Best, Boxun

Thank you very much.

I think RAM maybe the main problem. In my case (24 processors with 4.2GHZ and 64G RAM), one feather costs about 40G RAM, so the computer shut down when i used 2 feathers at the same time. And this problem was solved when i used only one feather, which cost 40GB/64GB. And then the prune2df run less than 10 min.

Many thanks.

ureyandy2009 avatar Jul 05 '20 01:07 ureyandy2009

I have faced the same issue recently and spent 3days trying to figure it out. Singularity build would't run for me on my institute's HPC, i kept getting this error : ERROR: You must install squashfs-tools to build images ABORT: Aborting with RETVAL=255

conda installation of squashfs-tools didn't work and needed system wide installation which was a hassle so didn't do it. what worked for me is the following :

my data set: 14766 cells × 23011 genes

1- specified an interactive session; srun --time=20:00:00 --partition=upgrade --nodes=1 --ntasks=1 --mem=128G --cpus-per-task=40 --pty /bin/bash -l

2- acitave conda environemt where pyscenic is installed.

3-run this script: everything is the same as in the tutorial page script: https://pyscenic.readthedocs.io/en/latest/tutorial.html

I just added: from dask.distributed import Client, LocalCluster

if __name__ == '__main__': adata=ad.read_loom('adata.all.pocessed.loom') ex_matrix=adata.to_df()

tf_names = load_tf_names(MM_TFS_FNAME)
db_fnames = glob.glob(DATABASES_GLOB)
def name(fname):
    return os.path.splitext(os.path.basename(fname))[0]
dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames]`


adjacencies=pd.read_csv("net2.tsv", index_col=False, sep='\t')
modules = list(modules_from_adjacencies(adjacencies, ex_matrix))

# Calculate a list of enriched motifs and the corresponding target genes for all modules.
with ProgressBar():
    df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME, client_or_address=Client(LocalCluster()))
# Create regulons from this table of enriched motifs.
regulons = df2regulons(df)

# Save the enriched motifs and the discovered regulons to disk.
df.to_csv(MOTIFS_FNAME)
with open(REGULONS_FNAME, "wb") as f:
    pickle.dump(regulons, f)

total consumed time:50minutes

naila53 avatar Feb 06 '21 17:02 naila53