sleuth icon indicating copy to clipboard operation
sleuth copied to clipboard

Reporting RAM requirements per sample when using multiple cores

Open teshomem opened this issue 8 years ago • 14 comments

Hi,

sleuth_0.29.0 R 3.3

Although I run sleuth in command line with setting num_cores to an integer number, sleuth always use only 1 core except at the end then throws error message:.

summarizing bootstraps ....................NULL Error in sleuth_prep(guide, num_cores = 20, target_mapping = ttg, aggregation_column = "ncbi_gene", : At least one core from mclapply had an error. See the above error message(s) for more details. In addition: Warning messages: 1: In sleuth_prep(guide, num_cores = 20, target_mapping = ttg, aggregation_column = "ncbi_gene", : 26488 target_ids are missing annotations for the aggregation_column: ncbi_gene. These target_ids will be dropped from the gene-level analysis. If you did not expect this, check your 'target_mapping' table for missing values. 2: In parallel::mclapply(x, y, mc.cores = num_cores) : all scheduled cores encountered errors in user code Execution halted

so <- sleuth_prep(guide, target_mapping = ttg, num_cores=20, aggregation_column = 'ncbi_gene', extra_bootstrap_summary = TRUE)

Why is that?

Best, Teshome

teshomem avatar Sep 20 '17 10:09 teshomem

Hi @teshomem,

It appears that there is an error that occurs during the summarizing step, and it's happening in all cores (2: In parallel::mclapply(x, y, mc.cores = num_cores) : all scheduled cores encountered errors in user code Execution halted). I'm working to improve the error reporting, but for now, what happens if you run with num_cores = 1. Does it work fine, or does it report an error? If it reports an error, you need to either fix the issue or report back to us before it will run smoothly with all 20 cores.

warrenmcg avatar Sep 20 '17 13:09 warrenmcg

Hi @warrenmcg,

Thank you for improving the error reporting. I started the script and it will take a while because i have 168 samples with two conditions. I will report back when it is done.

teshomem avatar Sep 20 '17 14:09 teshomem

Hi @teshomem,

If all scheduled cores encountered errors, the script should break on your first sample and should be done pretty quickly. If it doesn't, then there might be issues with pushing the machine too hard (e.g. your machine runs out of RAM in the midst of processing all of the data).

warrenmcg avatar Sep 20 '17 14:09 warrenmcg

Hi @warrenmcg,

I thought same and run the script on a machine with 1TB RAM. The RAM usage was <1%. It hangs on 'summarizing bootstraps' for long time.

teshomem avatar Sep 20 '17 14:09 teshomem

Hi @warrenmcg,

It is still running with num_cores=1 option. Please check my design here http://dpaste.com/386YKWB if you are interested.

teshomem avatar Sep 20 '17 18:09 teshomem

Hi @teshomem,

Your link doesn't work. Has at least one sample been processed? It would be odd for it be stuck after all this time on just one sample.

warrenmcg avatar Sep 20 '17 19:09 warrenmcg

Sorry, for some reason url redirection is not working here. I think you need to type it in your browser dpaste.com/386YKWB

teshomem avatar Sep 20 '17 21:09 teshomem

I tried the link http://dpaste.com/386YKWB, and I got your file.

For future reference, you can directly post a file onto GitHub using these instructions: link.

A few questions:

  1. Is your script still processing sample 1, or is it on a different sample? (hopefully the latter)
  2. What is the setup exactly for your environment (OS, etc.)? Is this on a cluster, or a workstation in the lab? Is it possible that there are restrictions to how much RAM and how many nodes/processors you can request?

warrenmcg avatar Sep 20 '17 21:09 warrenmcg

Hi @warrenmcg To answer your questions:

  1. How can i know which sample it was processing? Now i terminated the job. $> Rscript run-slueth.R [1] "prepare guide design" [1] "prepare target mapping" Read 136077 rows and 5 (of 5) columns from 0.011 GB file in 00:00:03 [1] "prepare sleuth" reading in kallisto results dropping unused factor levels ....................................................................................................................................................................... normalizing est_counts 51043 targets passed the filter normalizing tpm merging in metadata aggregating by column: ncbi_gene 50163 genes passed the filter summarizing bootstraps ................................................. ................................

  2. It is a cluster managed by SLURM. The node that i am working is CentOS 6.9, 80cores and 1TB RAM fat server. The restriction (i choose) is 20 cores and 100GB RAM (5GB per core) because i logged in as qlogin -n 20. I can request for more RAM and cores if needed.

teshomem avatar Sep 20 '17 22:09 teshomem

FYI, I am running it without RAM restriction to see the difference.

teshomem avatar Sep 20 '17 22:09 teshomem

Ah, I see. The dots are the way to keep track of how many samples have been processed. So, before you killed the job, it had processed what looks like around 100 samples. That's about 9-10 samples per hour or 1 sample per 6-7 mins, which isn't unreasonable.

What version of sleuth are you running? If it is before version 0.29, there was a huge memory footprint, which would explain why the job failed in parallel. If you are using the most up to date version, I would recommend that you use 10 GB per node to be safe. I will be interested to hear whether your script succeeds when there is no memory restriction.

warrenmcg avatar Sep 21 '17 00:09 warrenmcg

Hi @warrenmcg The version of sleuth running is sleuth_0.29.0

The job is completed successfully with unlimited RAM. I tried it with 10GB RAM per core and it works as well. I think the problem is RAM allocation per core.

Is it possible to improve the number of processed sample reporting (if possible with the name of the sample) instead of dots? It would also be nice if the minimum RAM per core requirement is specified (printed) when sleuth_prep starts so that users are aware the situation.

Thank you for the help.

teshomem avatar Sep 21 '17 07:09 teshomem

Hi @teshomem,

I'm glad your problem was solved. I'm changing the name of this issue to reflect the real remaining problem, which is requesting the improvements you've mentioned.

Pinging @pimentel with two user requests:

  1. changing how progress is reported
  2. mentioning minimum RAM requirements needed per sample. For this user, 5 GB per sample was not enough, but 10 GB was.

warrenmcg avatar Sep 21 '17 13:09 warrenmcg

@warrenmcg & @pimentel: How about capping the defaulting num_cores to the available memory / 10 G? How much does RAM usage depend on the number of bootstraps? Are there any guidelines on how many bootstraps are feasible with a larger sample size (say 200 human transcriptomes) and what the system requirements would be? IMHO it is kind of misleading to 'advertize' with 'minutes on a laptop' without mentioning that this is for a handfull of samples (and no/few bootstraps) when realistic applications might require hours on a server with hundreds of GB of RAM. :wink:

mschilli87 avatar Dec 02 '21 10:12 mschilli87