graftM icon indicating copy to clipboard operation
graftM copied to clipboard

resume/continue job

Open xvazquezc opened this issue 9 years ago • 4 comments

Hi, I've just installed GraftM in our cluster. I got GraftM (graft) to run but it crashed for not allocating enough RAM.

02/11/2016 01:16:57 PM INFO: Working on 028-LFA_S1_R1 02/11/2016 01:16:57 PM INFO: Working on forward reads 02/11/2016 01:32:42 PM INFO: Found 573 read(s) that may be eukaryotic 02/11/2016 01:33:31 PM INFO: 10659 read(s) detected 02/11/2016 01:33:31 PM INFO: aligning reads to reference package database 02/11/2016 01:36:12 PM INFO: Filtered 1788 short sequences from the alignment 02/11/2016 01:36:12 PM INFO: 8871 sequences remaining 02/11/2016 01:36:12 PM INFO: Working on reverse reads 02/11/2016 01:50:51 PM INFO: Found 576 read(s) that may be eukaryotic 02/11/2016 01:51:40 PM INFO: 10606 read(s) detected 02/11/2016 01:51:40 PM INFO: aligning reads to reference package database 02/11/2016 01:54:20 PM INFO: Filtered 1782 short sequences from the alignment 02/11/2016 01:54:20 PM INFO: 8824 sequences remaining 02/11/2016 01:54:20 PM INFO: Placing reads into phylogenetic tree =>> PBS: job killed: vmem 440318926848 exceeded limit 42949672960

I realised that there is no indication for resuming or continuing a crashed job. I tried to run the same command but it stops to avoid overwriting in the directory. This is the message:

Traceback (most recent call last): File "/home/z3382651/bin/mypythondir/mypythonenv/mypythonenv/bin/graftM", line 345, in Run(args).main() File "/home/z3382651/bin/mypythondir/mypythonenv/mypythonenv/lib/python2.7/site-packages/graftm/run.py", line 526, in main self.graft() File "/home/z3382651/bin/mypythondir/mypythonenv/mypythonenv/lib/python2.7/site-packages/graftm/run.py", line 238, in graft self.args.force) File "/home/z3382651/bin/mypythondir/mypythonenv/mypythonenv/lib/python2.7/site-packages/graftm/housekeeping.py", line 88, in make_working_directory raise Exception('Directory %s already exists. Exiting to prevent over-writing'% directory_path) Exception: Directory graftm already exists. Exiting to prevent over-writing

Is there any way to resume or at least a way to estimate the memory to be used?

Thank you in advance, Xabier

xvazquezc avatar Feb 11 '16 03:02 xvazquezc

Hey Xabier

Firstly thank you for your interest in GraftM!

The step that uses the most memory in GraftM is pplacer. The more sequences within a GraftM package, the more memory is required. In the publication for pplacer they demonstrate that the memory requirements are linear with respect to the number of taxa in the tree (Fig. 3). We're working now on estimating the memory usage you can expect for each of the 16S rRNA GraftM packages - we will get back to you on this. May I ask what specific GraftM package you were using for this run?

Unfortunately there is currently no way of picking up a failed GraftM run. Hopefully in most instances the time to re-run graftM graft isn't too much. To overwrite the previous run you can use the --force flag.

Thanks again Xabier, you'll hear from us soon.

Joel

geronimp avatar Feb 11 '16 23:02 geronimp

Hi Joe, I'm using GraftM 0.9.4. If it helps, with 12 threads, it has gone over 450 GB of memory with the GreenGenes 97 package. Thanks

PS: I think you have the wrong pplacer indicated in the README

xvazquezc avatar Feb 12 '16 00:02 xvazquezc

Hey Xabier,

So we've traced this down to an issue in the way memory usage is reported for pplacer. The memory usage by the 97 GreenGenes package should range 33 - 40 GB depending on the number of threads used by the run. Unfortunately pplacer lists the same memory allocated for the whole run as the total memory for each process, meaning that the memory usage measured by PBS (and top) is the real memory usage multiplied by the number of threads. So in your case the actual amount of mem used was likely 450/12 = 37.5

In the short term a work around could be found by specifying less threads overall the get past the memory cap. In the longer term we will raise an issue with pplacer and look at implementing a separate --pplacer_threads_flag to which you could specify the number of threads used at this step.

Apologies for the delayed reply on this one,

Joel

geronimp avatar Feb 15 '16 04:02 geronimp

Hi Joel, Just following on the memory requirements, I have quite a few samples so, I'm doing a bit of testing with the minimum cores, ie 1, with same parameters for anything else, so I can allocate more jobs on parallel (there is a cap on how many resources I can use at a time from the cluster and 450GB is a lot as it has to be requested over several nodes). In this case with a single core, in the pplacer step the job was requesting over 66 GB (way more than expected, it crashed because of it). I guess the memory requirements aren't very linear

xvazquezc avatar Feb 17 '16 23:02 xvazquezc