masurca icon indicating copy to clipboard operation
masurca copied to clipboard

How does Chromosome scaffolder work?

Open NTNguyen13 opened this issue 4 years ago • 14 comments

Hi, the script proves to be useful to put the assembled contigs together and reorganize their orientation, but I'm not clear about how it work. I understand the principle that (after splitting contigs) it aligns the contigs to the reference, and the use the alignment as the new assembly. However:

  • How does it treat clipped read? Will full read get included as insertion to the reference genome, or only the clipped alignment is kept?
  • When I disable the -M (fill) option, as I understand it will keep the gap between 2 contig as 'Ns', so my new assembly and the reference genome should have similar size. However, my original assembly is 2.8G, after scaffolding it's still 2.8G (it has around 50-60Mb increased), but not quite close to 3.0G of primary reference genome. So I think I understand this part wrong. Could you please enlighten me on this please?

Thank you very much.

NTNguyen13 avatar Dec 16 '20 10:12 NTNguyen13

Hi,

The reads are used to identify misassemblies in the assembly before scaffolding. First, assembly is aligned to the reference and all alignment breakpoints are identified. Then reads are aligned to the assembly and coverage around the breakpoints is examined. Very low coverage (>3) or high coverage (>Gcov/ln(2)) around the breakpoint is indicative of an apparent misassembly. Then the contigs are broken at misassemblies and re-aligned to the reference for final placement.

The scaffolded assembly size should be close to the reference size if gaps are excluded if the original assembly size is close to the size of the reference excluding gaps. Try comparing the sizes after removing all N's by, for example running tr -d < reference.fa | ufasta n50 -a tr -d < scafolded_genome.fa | ufasta n50 -a

alekseyzimin avatar Dec 17 '20 17:12 alekseyzimin

Hi,

The reads are used to identify misassemblies in the assembly before scaffolding. First, assembly is aligned to the reference and all alignment breakpoints are identified. Then reads are aligned to the assembly and coverage around the breakpoints is examined. Very low coverage (>3) or high coverage (>Gcov/ln(2)) around the breakpoint is indicative of an apparent misassembly. Then the contigs are broken at misassemblies and re-aligned to the reference for final placement.

The scaffolded assembly size should be close to the reference size if gaps are excluded if the original assembly size is close to the size of the reference excluding gaps. Try comparing the sizes after removing all N's by, for example running tr -d < reference.fa | ufasta n50 -a tr -d < scafolded_genome.fa | ufasta n50 -a

Does the masurca result repeatable???

chiu-shenpo avatar Dec 18 '20 06:12 chiu-shenpo

Yes, the result should be deterministic.

On Fri, Dec 18, 2020 at 1:22 AM chiu-shenpo [email protected] wrote:

Hi,

The reads are used to identify misassemblies in the assembly before scaffolding. First, assembly is aligned to the reference and all alignment breakpoints are identified. Then reads are aligned to the assembly and coverage around the breakpoints is examined. Very low coverage (>3) or high coverage (>Gcov/ln(2)) around the breakpoint is indicative of an apparent misassembly. Then the contigs are broken at misassemblies and re-aligned to the reference for final placement.

The scaffolded assembly size should be close to the reference size if gaps are excluded if the original assembly size is close to the size of the reference excluding gaps. Try comparing the sizes after removing all N's by, for example running tr -d < reference.fa | ufasta n50 -a tr -d < scafolded_genome.fa | ufasta n50 -a

Does the masurca result repeatable???

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/204#issuecomment-747894633, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHPQSYPDYJ2VPMICPTDSVLYJFANCNFSM4U5YT5MQ .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

alekseyzimin avatar Dec 18 '20 13:12 alekseyzimin

Yes, the result should be deterministic.

No, I have tried used exactly the same config file for two runs(same config file same command) But the results were slightly different of the final.genome.fasta.

chiu-shenpo avatar Dec 18 '20 13:12 chiu-shenpo

I thought you were talking about chromosome scaffolder. The MaSuRCA runs are not deterministic, especially if you use Flye for contigging. The assemblies will differ from run to run, but not significantly.

On Fri, Dec 18, 2020 at 8:50 AM chiu-shenpo [email protected] wrote:

Yes, the result should be deterministic. … <#m_8853096821311755647_>

No, I have tried used exactly the same config file for two runs(same config file same command) But the results were slightly different of the final.genome.fasta.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/204#issuecomment-748094201, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHJATJP63X6QCASQ3NDSVNMZTANCNFSM4U5YT5MQ .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

alekseyzimin avatar Dec 18 '20 13:12 alekseyzimin

I thought you were talking about chromosome scaffolder. The MaSuRCA runs are not deterministic, especially if you use Flye for contigging. The assemblies will differ from run to run, but not significantly. Thanks for ur reply, May I ask how to prove the difference is not significant and why Masurca is designed for not being deterministic ??

chiu-shenpo avatar Dec 19 '20 01:12 chiu-shenpo

Hi,

The reads are used to identify misassemblies in the assembly before scaffolding. First, assembly is aligned to the reference and all alignment breakpoints are identified. Then reads are aligned to the assembly and coverage around the breakpoints is examined. Very low coverage (>3) or high coverage (>Gcov/ln(2)) around the breakpoint is indicative of an apparent misassembly. Then the contigs are broken at misassemblies and re-aligned to the reference for final placement.

The scaffolded assembly size should be close to the reference size if gaps are excluded if the original assembly size is close to the size of the reference excluding gaps. Try comparing the sizes after removing all N's by, for example running tr -d < reference.fa | ufasta n50 -a tr -d < scafolded_genome.fa | ufasta n50 -a

Hi, I have tried that and also running scaffolder with -M, the result is 2 references have similar size, thank you for clarifying.

And how about the soft/hard clipped reads? Are the clipped parts included in the final assembly?

NTNguyen13 avatar Dec 19 '20 01:12 NTNguyen13

The non-deterministic nature of MaSuRCA likely comes from random thread execution order on Flye assembler that is used as a contigging engine in MaSuRCA. If you use the default CABOG assembler, the result should be deterministic up to overlap order. In contig building (bog unitigger) there may be two different overlaps that have the same score and whichever thread gets the overlap first determines which overlap is chosen for continuing the contig.

On Fri, Dec 18, 2020 at 8:33 PM chiu-shenpo [email protected] wrote:

I thought you were talking about chromosome scaffolder. The MaSuRCA runs are not deterministic, especially if you use Flye for contigging. The assemblies will differ from run to run, but not significantly. … <#m_-6557199324101006438_> Thanks for ur reply, May I ask how to prove the difference is not significant and why Masurca is designed for not being deterministic ??

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/204#issuecomment-748398645, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHPEXMX6UWN4QECQ4E3SVP7FXANCNFSM4U5YT5MQ .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

alekseyzimin avatar Dec 19 '20 02:12 alekseyzimin

The non-deterministic nature of MaSuRCA likely comes from random thread execution order on Flye assembler that is used as a contigging engine in MaSuRCA. If you use the default CABOG assembler, the result should be deterministic up to overlap order. In contig building (bog unitigger) there may be two different overlaps that have the same score and whichever thread gets the overlap first determines which overlap is chosen for continuing the contig.

Be honestly, I have tried both multiple CPU and single CPU under Flye and Cabog. The results were all not deterministic. Slightly different maybe you can explain and prove the difference is not statistically significant then it will be easier for me lol. Cuz from I see this assembler, it’s not repeatable and deterministic is important for experiment. I really hope the author can solve the problems that bother me several years.

chiu-shenpo avatar Dec 19 '20 02:12 chiu-shenpo

So .. I guess there’s no answer for this not deterministic assembler.

chiu-shenpo avatar Dec 22 '20 05:12 chiu-shenpo

Hi, I'm still waiting for my question about scaffolder. @chiu-shenpo I think your topic is a totally different with what I'm asking, you should have opened another issue to avoid off-topic comments.

NTNguyen13 avatar Jan 14 '21 02:01 NTNguyen13

Hi, I'm still waiting for my question about scaffolder. @chiu-shenpo I think your topic is a totally different with what I'm asking, you should have opened another issue to avoid off-topic comments.

Firstly, this author will avoid the question he cant answer, what are u expecting lol. Btw , think about if u cant even get deterministic results using this tool.

chiu-shenpo avatar Jan 14 '21 03:01 chiu-shenpo

Of course, it does matter, and I thank you for pointing that out. However, let be clear here, if you open another issue, then more people will acknowledge the situation, rather than just me.

NTNguyen13 avatar Jan 14 '21 07:01 NTNguyen13

I confirm, MaSuRCA will not give deterministic results in most cases, because of the underlying assembly/scaffolding engine used. MaSuRCA uses either CABOG assembler or Flye assembler for final contigging/scaffolding, and neither of these assemblers is deterministic if run with more than 1 thread.

On Tue, Dec 22, 2020 at 12:34 AM chiu-shenpo [email protected] wrote:

So .. I guess there’s no answer for this not deterministic assembler.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/204#issuecomment-749349238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHPL5M7A3J6WDGYRIBTSWAVX7ANCNFSM4U5YT5MQ .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

alekseyzimin avatar Jan 14 '21 15:01 alekseyzimin