CAMSA icon indicating copy to clipboard operation
CAMSA copied to clipboard

Can CAMSA use multithread or MPI ?

Open AMMMachado opened this issue 6 years ago • 3 comments

Hi aganezov,

First of all thank you for this promising program.

I processed some assemblies and scaffolds with fasta2camsa to create points files. But these are running in single thread and is quite slow. Until now +- 24 h.. I runned the tutorials without problems... So, apparently all its ok.

My commands .

fasta2camsa_point.py --nucmer-path /my/path/to/nucmer contigs.fasta scaff_Soap_k61.fasta scaff_Soap_k43.fasta scaff_Soap_k23.fasta scaff_Abyss_k61.fasta scaff_Abyss_k43.fasta scaff_Abyss_k23.fasta -o /test

These are several assemblies of a fish with more or less 1GB of genome size and a bit heterozygous. Do you can recommend me some more parameters to improve the merging of the several scaffolds and to increase the speed of analyses.

Thanks in advance Andre

AMMMachado avatar May 29 '18 16:05 AMMMachado

Hi Andre,

thank you for your interest in CAMSA.

The core of CAMSA (i.e., the scaffolding merging/comparative analysis does not use any of the parallelization techniques, as even on large genomes it demonstrated exceptional speed, due to the heuristic nature of the underlying algorithm).

The issue that you seen to experience lies in the preprocessing part of converting the raw fasta files (i.e., contigs and scaffolds) into the CAMSA suitable input using the mummer aligner. When CAMSA was originally developed I've focused on using the latest at that time version of mummer (3.0), but since then, I believe, a newer version of mummer (4.0) has been released (refer to this documentation link). I believe the main bottleneck lies in the nucmer run (please correct me if I'm wrong and if you have seen in the logger output evidence that nucmer actually finished and further steps started being executed?), and thus you can try to specify the --nucmer-path flag to point to the executable of the newer version of the nucmer, and with the --nucmer-cli-arguments flag specify the fact, that you want the the nucmer to run in the multithreaded fashion?

Note that by default the values of the --nucmer-cli-arguments flag is -maxmatch -c 100 (i.e., no multithreading/multiprocessing), but you can try to extend it to include thread/core specific values.

Let me know if this helps, and if you were to successfully use the newer version of mummer and/or its support of multi-threaded execution, as I would love to add that to the description of the fasta2camsa_points.py utility.

Sincerely, Sergey Aganezov.

aganezov avatar May 29 '18 16:05 aganezov

I get this error because the option should be --mum not -mum /home/data/bioinf_resources/programming_tools/mummer-3.9.4alpha/bin/nucmer: invalid option -- 'm' Use --usage or --help for some help

I've seen this in falcon as an error before and this would correct it for latest nucmer versions. Where would I make correction please?

doing this first command that fails works. It's just changing my command for fasta2camsa_points.py that picks up the correct nucmer options, any ideas /home/data/bioinf_resources/programming_tools/mummer-3.9.4alpha/bin/nucmer --mum --threads=30 -p /home/data/pest_genomics/CSFB/CSFB_genome/assembly705/fasta2camsa/assembly705_oneline assembly705_oneline.fasta 705_1000_split.fasta > /home/data/pest_genomics/CSFB/CSFB_genome/assembly705/fasta2camsa/logs/nucmer_assembly705_oneline.stdout.txt 2> /home/data/pest_genomics/CSFB/CSFB_genome/assembly705/fasta2camsa/logs/nucmer_assembly705_oneline.stderr.txt

Doing the below fasta2camsa_points.py --nucmer-path /home/data/bioinf_resources/programming_tools/mummer-3.9.4alpha/bin/nucmer --nucmer-cli-arguments "--mum --threads=40" 705_1000_split.fasta assembly705_oneline.fasta CSFB560_pseudohap_main.fasta assembly627.fasta -o .

results in this, seems like the = is being parsed which is annoying so 40 appears at end. /home/data/bioinf_resources/programming_tools/mummer-3.9.4alpha/bin/nucmer --mum --threads -p /home/data/pest_genomics/CSFB/CSFB_genome/assembly705/fasta2camsa/705_1000_split 705_1000_split.fasta 40 > /home/data/pest_genomics/CSFB/CSFB_genome/assembly705/fasta2camsa/logs/nucmer_705_1000_split.stdout.txt 2> /home/data/pest_genomics/CSFB/CSFB_genome/assembly705/fasta2camsa/logs/nucmer_705_1000_split.stderr.txt

rob123king avatar Jul 18 '18 12:07 rob123king

Hello @rob123king, and I'm sorry that you've experienced the issue with CAMSA.

Please try to use the = sign to assign name parameters in the script invocation. You example then should be rewritten as:

fasta2camsa_points.py --nucmer-path="/home/data/bioinf_resources/programming_tools/mummer-3.9.4alpha/bin/nucmer" --nucmer-cli-arguments="--mum --threads=40" 705_1000_split.fasta assembly705_oneline.fasta CSFB560_pseudohap_main.fasta assembly627.fasta -o .

This should solve the problem, I believe. Please, let me know if this works!

aganezov avatar Jul 18 '18 15:07 aganezov