progressiveCactus
progressiveCactus copied to clipboard
Several months for alignment
I am aligning 5 assembled mammalian genomes, and the runtime is over 2 months at this point. It does not appear to have crashed, as the log file has lines from today. Is there any way to determine where the alignment is at, and how much longer it will take? Do you have recommendations for speeding up the process? Thanks, Juan
That's a very long time, maybe it's crashed or deadlocked somehow. How many cores are you running with? On our systems, with a binary guide tree, it takes roughly 120*(n - 1) CPU days for an alignment, n being the number of species. In wall-clock time it takes us about 0.5 to 1 day per species with a 500-1000 core cluster. If you use a star guide tree it'll take much longer, since that will scale quadratically rather than linearly.
We don't have a very clear way of telling the progress. Every so often the log will say something like:
Ingroup sequences: ['/cluster/home/jcarmstr/progressiveCactus/mammals1/work/jobTree/jobs/gTD4/tmp_VDNOVdId2j/renamedInputs/simMouse.chr6_1', '/cluster/home/jcarmstr/
progressiveCactus/mammals1/work/jobTree/jobs/gTD4/tmp_VDNOVdId2j/renamedInputs/simRat.chr6_2']
which tells you that it's begun working on the part of the tree (mouse,rat);
(the full tree may look like ((human, chimp),(mouse,rat));
for example). Some of the names may look like "Anc1", "Anc2", etc.--these are automatically-generated names for the ancestral reconstructions. We should probably log the progress in a less inscrutable way.
So hopefully there at least a few instances of the "ingroup sequences" log line by now, otherwise something has gone very wrong!
Hi Joel, Thank you for the quick response. I am running this alignment on a server with 64 cores, however I ran this with default parameters (single core). Based on your formula, the run should take 120*5=600 CPU days or 1.6 years on a single core! If I kill this job and run on 60 cores I should have results in 10 days?
In terms of crashing, here are some excerpts from the log:
The first time reported in the log is: Got message from job at time: 1476459103.32... (trimmed)
The last line in the log says: Got message from job at time: 1482404519.58 : Blasting ingroups vs outgroups to file (trimmed)
I see two lines mentioning "Ingroup sequences": Got message from job at time: 1477237433.84 : Ingroup sequences: Got message from job at time: 1482404519.52 : Ingroup sequences:
How do I convert these times to YYYY-MM-DD hh:mm:ss ?
Thanks, Juan
Ah OK, if it is just running with the defaults, it makes sense that it would take so long. (We use low defaults to avoid overloading anyone's machine even if they are running on small data like bacteria/C. elegans). From the log it looks like it made progress, so it looks like the number of cores was the only issue. Yep, I think it should finish in roughly 10 days if you give it 60 cores. There can be some variability, but I certainly think something must be going wrong if it took 20, for example.
The timestamps are in UNIX time (you can find a converter here, so the first one was Oct 23rd and the second was Dec 22nd (!). In our next release the timestamps will be in a more useful time format by default :)
In terms of parameters to change in order to improve performance, can you let me know I'm doing this right? I have a machine with 60 cores and 200GB RAM. How do I maximize use of this? I'm currently using the parameters: --maxThreads 60 --defaultMemory 200000000000000 Is that correct? Thanks, Juan