Ragout
Ragout copied to clipboard
Alternative alignment tool
Good day, I'm currently working on the assembly of a large mammalian genome. I would like to perform the reference-assisted scaffolding using ragout. However, I cannot run progressive cactus on the genomes involved due to computational constraints, and the assembly are simply too large to be efficiently processed with sibelia.
My question is: is there a way to create alignments in a way that is suitable for Ragout2?
For example: pairwise lastz alignments -> combined multi-species MAF file -> maf2hal -> ragout2
Thank you in advance, Andrea
Hi Andrea,
It is possible to use .maf as input (I was planning to retire this functionality, but it is still in the code) - but the problem is that Ragout relies on properties of the Cactus alignments. In particular, alignment blocks should be non-intersecting - which is not true for most of the other aligners (since it is in fact a very hard problem to generate such alignments).
What are your constraints? Cactus should take ~120 CPU days per mammalian genome, which translates into a few actual days on a single cluster node.
There is a recent Sibelia update (https://github.com/medvedevgroup/SibeliaZ) that in fact should handle large genomes and is capable of producing the right maf alignment. I have not tested it yet - but you might try to use it to generate alignment and run Ragout with it.
Best, Mikhail
Hi Mikhail, Thank you for your reply. My problem with cactus is not related to computational power/time, but to incompatibilities with toil and the cluster, that prevent it from running properly. I think that using sibeliaZ seems a good solution. I guess that you would proceed as follow: sibeliaZ > maf > hal > ragout
Am I right? Thank you in advance, Andrea
You can use maf directly by adding an extra recipe parameter .maf = path_to_maf
, and using -s maf
switch.
I've just tried to run Ragout on the alignments produced using SibeliaZ (maf format) using the genomes provided in example/E.Coli folder. I've changed the header of the respective fasta so that it shows the organism in the format "dh1.gi", "ms1655.seq1", etc. The maf file is properly parsed, even if raising errors of overlapping blocks, and finally crashes while refining the assembly graph. I though that the problem was due to the presence of self-alignments within the same assembly. But even after filtering those out from the MAF file, the result doesn't change, still crashing while detecting chimeric adjacencies. Below, you can find the log produced, the error, the configuration file, the two fai with the names used and the top 100 lines of the sibeliaZ maf file. I hope these information can be of use to understand the problem. Thank you in advance for your help, Andrea
[16:09:31] DEBUG: >>With block size: 5000
[16:09:31] DEBUG: Checking 2 target adjacencies
[16:09:31] DEBUG: Found 1 target-specific adjacencies, 1 broken as chimeric
[16:09:31] DEBUG: >>With block size: 500
[16:09:31] DEBUG: Checking 16 target adjacencies
[16:09:31] DEBUG: Found 3 target-specific adjacencies, 1 broken as chimeric
[16:09:31] DEBUG: >>With block size: 100
[16:09:31] DEBUG: Checking 105 target adjacencies
[16:09:31] DEBUG: Found 11 target-specific adjacencies, 3 broken as chimeric
[16:09:31] DEBUG: >>With block size: 100
[16:09:31] DEBUG: Checking 105 target adjacencies
[16:09:31] DEBUG: Found 11 target-specific adjacencies, 3 broken as chimeric
Traceback (most recent call last):
File "/exports/applications/apps/community/roslin/ragout/2.1.1/bin/ragout", line 35, in
ragout.log testRagout.rcp.txt dh1.fasta.fai.txt mg1655.fasta.fai.txt top100.maf.txt
Looks like there are some caveats with using SibeliaZ. I will be testing it in the near future.
Hi, The error found was due to the input fasta files provided. Basically, the input fasta that I provided included the species name in each sequence header (e.g. >spp1.chr1), but ragout needed them as simple sequence name (e.g. >seq1). After using this fasta file, the program worked without issues.
The only thing that concerns me is this warning:
Too few overlaps (18) between contigs were detected -- refine procedure will be useless
Is there a way to improve this? Thank you in advance,
Andrea
Great, I'm glad that you got it working! Did it complain on the synteny block coverage?
Too few overlaps (18) between contigs were detected -- refine procedure will be useless
This step is optional - it only works good with certain assemblers (like SPAdes) that output contigs with overlapping ends. We usually don't use --refine
option for large genomes.
hi, I have similar issues. I was using sibeliaz as well for the alignment of large genomes and now I am trying to run ragout with the maf file. I was following the instructions above, adding the recipe parameter .maf and the -s maf switch, and I also changed the headers in a simple format like that
scaffold4594 but ragout always breaks after some time with the following error ERROR: permutation ids do not follow naming convention: 'genome.chromosome'
then I tried to change the headers in a format like
WB42.scaffold1 WB42.scaffold6 WB42.scaffold9 WB42.scaffold13
but then that error appeared:KeyError: 'scaffold830'
Do I name the headers still in a wrong way or does anybody have any ideas how to solve that problem? Thank you in advance
if helpful or necessary the code I used for ragout: ragout -o ragout_out -s maf -t 10 recipe_file_for_ragout.txt
Hi,
It's a bit tricky - the MAF alignment should have the genome prefixes, but the FASTA files - should not. I think in your case it will be enough to remove the prefixes from fasta, also make sure that genome names are matching with ones provided in the recipe.
Let me know if this helps. Hopefully in the near future I will properly integrate SibeliaZ into the pipeline.
Hi Mikhail,
thank you so much for your reply! Ragout finished successfully after removing the prefixes from the fasta files.
Best regards
Dear Mikhail!
I'm dealing with number of green algae genome. Ragout w/ Sibelia (single-threaded) is running for days. I tried to use SibeliaZ and got MAF alignments. My refences are from NCBI (GCF* GCA* FASTA files), target assembly made with SPAdes.
What I need to do with input files (FASTA/alignment) to use them with Ragout?
How to understand following error (debug don't give a hint)? ERROR: No sequences read for genome c2. Check recipe for correctness. File exists, recipe looks like:
.references = c2
.target = s68
.maf = sz_395/alignment.maf
s68.fasta = algs68.fasta
c2.fasta = GCA_001021125.1_ASM102112v1_genomic.fna
Ok, it seems that you have found the .maf
option :)
The sequence headers in MAF should be formatted as genome_name.contig_name
, and I believe SibeliaZ just outputs contig_name
. So you'd need to add genome names to MAF (according to the naming in the recipe).
Sorry for the inconvenience! SibeliaZ support is in the plans, but I didn't get a chance to work on that yet.