RagTag icon indicating copy to clipboard operation
RagTag copied to clipboard

different type of errors in scaffolding

Open dcopetti opened this issue 3 years ago • 3 comments

Hello,

I am scaffolding a Hifiasm assembly of a selfed tetraploid plant (~400 Mb assembly, contig N50 11.7 Mb, N90 2.5 Mb) with a reference assembly created by merging the assemblies of the two diploid progenitors (not the actual parental plant, but the same species). So the contigs will go on the reference with a 1:1 ratio.

The command was all default: ragtag.py scaffold -t 3 -o MUR_H2_ragtag ref_tetraploid.fa contigs.fa and I notice different behaviors in the output (I am sending you the files separately).

  • hal_chr1: the second contig (ptg000024l) is sorted RC on the pseudos
  • hal_chr1: towards the 3' end of it, there is a contig that should go on lyr_chr1
  • hal_chr1 and hal_chr2: two contigs are split half in chr1 and half in chr2 - we decided to trust this new assembly. No action here
  • hal_chr5, 6, 7: the inversions/shuffling are fine
  • hal_chr8: concatenated with lyr_chr8 (too much background noise from the homeolog?)

Is there an explanation for placing a contig RC when not needed, or for swapping a contig to the other homeolog? I understand that the cases with a contig going to two chromosomes will confuse the scaffolding as well.

I could be fixing these issues manually editing the agp file, but this would defeat the purpose of RagTag. Is there a way to overcome such clear mis-placements? Thanks,

Dario

dcopetti avatar Apr 30 '21 08:04 dcopetti

Hi Dario,

I think the first thing to try is to use Nucmer. I suggest following the instructions in #48. If you think you need more specificity to distinguish between the homeologs, you can increase the values for -l and -c.

That might do the trick. If not, I would be happy to look at the data.

Thanks, Mike

malonge avatar Apr 30 '21 12:04 malonge

Hi Mike, I can't run RagTag with nucmer now because other jobs don't leave enough memory on the machine. But in the meanwhile I realized that I had many haplotigs (even though hifiasm did not label them as such), so I run Purge Haplotigs and it removed many of the very short sequences that you can see align inside a larger contig (here some of them are added to the scaffold by RagTag): Capture

Re-running RagTag with only the Purge Haplotigs' primary contigs and minimap I get a clearer scaffolding pattern TSK_H2pchr_to_synref_minimap where I would just move the last contig from hal_chr1 to hal_chr2 and reverse complement hal_chr6 (quite arbitrary though)

Can it be that the many hits coming from the short contigs confuse RagTag's algorithm?

dcopetti avatar Apr 30 '21 14:04 dcopetti

Hi there,

Good to see that purging haplotigs helped out. It's hard to say how these smaller contigs may have affected things without digging into things a bit. Usually, alignment is the biggest factor for RagTag, which is why I usually recommend trying Nucmer and Minimap2 for tougher applications.

Theoretically, I would expect haplotigs to appear as false duplications in the ragtag output, but I wouldn't expect any false translocations or inversions, as you described. So that makes me guess that improved alignments account for these improvements. But I could be wrong.

Thanks, Mike

malonge avatar Apr 30 '21 16:04 malonge