chromosomer icon indicating copy to clipboard operation
chromosomer copied to clipboard

Final chromosome assembly 5 time shorter than expected

Open agroppi opened this issue 7 years ago • 10 comments

I have used Chromosomer following the brief guide : https://github.com/gtamazian/chromosomer/wiki/Brief-guide-to-Chromosomer-assembly-process

But at the end, in the assembled_chromosomes.fa :

  • it lacks one chromosome
  • the total size is about 5 time shorter than expected (55 Mb insteads of : reference genome is 220 Mb - fragments size is 224 Mb)

Thanks for your help

agroppi avatar Oct 23 '17 10:10 agroppi

I would advise to check alignments between the fragments and the reference chromosomes. If there are multiple short alignments, then you may try to combine and make them longer by modifying aligner's parameters or chaining the alignments. Another option is to reduce chromosomer fragmentmap's ratio threshold from its default value 1.2 to 1.1 or 1.05.

gtamazian avatar Oct 31 '17 16:10 gtamazian

Thanks for your advice. I'll try and will tell you how it goes

agroppi avatar Nov 08 '17 15:11 agroppi

@gtamazian using chromosomer fragmentmap's ratio threshold 1.05 greatly improved the final assembly : All chromosomes are present and the size is ~178 Mb The reference genome (close species) contains 8 chromosomes and 183 unplaced scaffolds. Should I use the whole reference genome or only the 8 chromosomes ? (what I have done so far ...) Thanks again

agroppi avatar Nov 09 '17 11:11 agroppi

@agroppi To assess your current assembly, you may align it to the reference genome for each pair of chromosomes and visualize the obtained alignments. LASTZ can be used for performing whole-chromosome alignments: http://www.bx.psu.edu/~rsharris/lastz/. For alignment visualization, you may use filtered dot plots: https://gtamazian.com/2016/05/16/filtering-noise-in-lastz-dot-plots/.

Considering your question on using the whole reference genome vs only assembled reference chromosomes: using the whole genome is more appropriate because Chromosomer aims at locating fragments on the reference genome based on highest-score homologies. Excluding unplaced scaffolds might result in missing high-score alignments of the fragments to the scaffolds and wrongly placing the fragments to the chromosomes.

gtamazian avatar Nov 18 '17 15:11 gtamazian

Actually, after many tries, changing the parameters of blastn (from default to -perc_identity 40 -word_size 28 -evalue 50) doesn't change anything. The only parameterwho influence the final assembly is the Chromosomer fragmentmap's ratio threshold (I try 1.05 and even 1.025)

agroppi avatar Jul 02 '18 15:07 agroppi

The current version of Chromosomer uses a single best-score alignment as an anchor between a fragment and a reference chromosome to map the fragment to the chromosome. This approach proved to work well for closely related genomes but might fail for distant genomes. In your case, it seems that Chromosomer fails to establish reliable anchors between the fragments and the assembly.

I would advise you to try the Satsuma package: http://satsuma.sourceforge.net/. It includes the Chromosembler tool that maps scaffolds or contigs onto chromosome coordinates using the syntenic alignment obtained with Satsuma.

gtamazian avatar Jul 03 '18 12:07 gtamazian

Thank you for you for your answer and this advice. Strange for me because I'm using very closely related genomes (peach vs apricot) I will try Satsuma. Thanks again

agroppi avatar Jul 03 '18 12:07 agroppi

One more consideration for your case: since you use plant genomes, unmasked genomic repeats within them might have split the alignments, leading to poor assembly by Chromosomer. The repeats can cause problems for Satsuma too, so I would suggest to check the alignments before launching Satsuma's Chromosembler. If you get numerous short-range alignments and few long-range alignments, you are likely to have missed repeats in the masking before launching the alignment procedure.

gtamazian avatar Jul 03 '18 12:07 gtamazian

I have masked both (reference and my dreft assembly with repeatmasker 4.0.7 : RepeatMasker -species arabidopsis -pa 20 How do you check the alignments (if they are split in short range alignments) ?

agroppi avatar Jul 03 '18 12:07 agroppi

Although I am not a specialist in the plant genomics, I suppose that RepeatMasker's library of the arabidopsis repeats might miss repeats that are specific to your genomes. You might try to use WindowMasker and compare its repeats to ones by RepeatMasker.

To check the alignments, you may analyze the distribution of alignment lengths. Another option is to sort the alignments by their coordinates on the reference genome and plot their lengths versus their start positions.

gtamazian avatar Jul 03 '18 13:07 gtamazian