shasta icon indicating copy to clipboard operation
shasta copied to clipboard

how to collapse trio-binned ONT reads

Open COMInterop opened this issue 1 year ago • 3 comments

Hello,

I have a question which could continue from #200

I have low coverage of a repetitive plant genome and I would like to optimize the parameters.

My situation is similar to the Lolium because I have only 34x ONT coverage with N50 23.6kb.

And, it is different because I do trio-binning with short reads from parents. Therefore I have half coverage, yes much less than optimum, but I hope to have nice result.

Therefore, since reads are phased, I want to ask how to tell Shasta that I want haploid output.

The program makes --MarkerGraph.minCoverage to 5. Is it realistic to make this lower, like 3?

Also I have question about --ReadGraph.strandSeparationMethod. Is this relevant for my case?

Thank you.

COMInterop avatar Jul 26 '22 00:07 COMInterop

If I understand correctly you want to assemble the two haplotypes separately, and you have approximately 17x for each haplotype. Can you confirm?

Also, what Guppy version and accuracy option did you use to base call these reads? Ideally, it would be Guppy 5.0.7 or later with "super" accuracy. If this is the case, see here for some suggestions about selecting assembly configurations. In the assembly configurations for haploid assembly summarized there, --MarkerGraph.minCoverage is set to 0, which means that it gets selected automatically. But it is certainly plausible to manually set it to a small value like 3 if you have low coverage.

If you have older reads, please let me know what Guppy version you used and confirm the coverage per haplotype. With that information I should be able to make suggestions.

As an additional option, you could also consider Shasta phased assembly, passing the entire set of reads as input. The page I linked above contains some suggestions for phased assembly too. I can give you more information on that once you confirm the Guppy version and coverage.

paoloczi avatar Jul 26 '22 01:07 paoloczi

Yes, I want to assemble the separate haplotypes and I have 17x for each.

I use Guppy 5.1.12, so I set consensus caller to 5.0.7-b.

When I assemble all reads with haploid option I get genome that is 99% size of reference. But when I assemble with phased option I get 65%. At present with binned reads I get 77% of reference size, the data are very similar for the both haplotypes. Then I think it is usable already but also I want more contiguity.

When I compare Nanopore-Human-SingleFlowcell-May2022 with Nanopore-Plants-Apr2021 I see these are different:

#Nanopore-Plants maxSkip = 60 maxDrift = 20 maxTrim = 60 minAlignedMarkerCount = 200 minAlignedFraction = 0.3

#Nanopore-Human-Singleflowcell maxSkip = 30 maxDrift = 15 maxTrim = 30 minAlignedMarkerCount = 200 minAlignedFraction = 0.6

So Plants is less stringent already? Thank you very much, it is very nice to have a fast tool like this one.

El lun, 25 jul 2022 a la(s) 20:23, paoloczi @.***) escribió:

If I understand correctly you want to assemble the two haplotypes separately, and you have approximately 17x for each haplotype. Can you confirm?

Also, what Guppy version and accuracy option did you use to base call these reads? Ideally, it would be Guppy 5.0.7 or later with "super" accuracy. If this is the case, see here https://chanzuckerberg.github.io/shasta/Configurations.html for some suggestions about selecting assembly configurations. In the assembly configurations for haploid assembly summarized there, --MarkerGraph.minCoverage is set to 0, which means that it gets selected automatically. But it is certainly plausible to manually set it to a small value like 3 if you have low coverage.

If you have older reads, please let me know what Guppy version you used and confirm the coverage per haplotype. With that information I should be able to make suggestions.

As an additional option, you could also consider Shasta phased assembly, passing the entire set of reads as input. The page I linked above contains some suggestions for phased assembly too. I can give you more information on that once you confirm the Guppy version and coverage.

— Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/shasta/issues/293#issuecomment-1194862011, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABV6OWSOR2GTSYFJ6BIVWUDVV442XANCNFSM54UD4MCA . You are receiving this because you authored the thread.Message ID: @.***>

COMInterop avatar Jul 26 '22 13:07 COMInterop

Yes, I want to assemble the separate haplotypes and I have 17x for each.

Thank you for confirming that. That is low coverage but, with some tweaking of assembly parameters, should still be sufficient to generate a good quality assembly.

I use Guppy 5.1.12, so I set consensus caller to 5.0.7-b.

Did you specify "super" accuracy when running Guppy? The reads generated that way are much better. The 5.0.7-b Bayesian caller and all of the May2022 configurations assume "super" accuracy. If you ran the base caller without "super" accuracy I suggest regenerating the reads. I know it is a pain, but it will be worth the trouble. I don't know what the option for that is, but you should be able to find it in Guppy documentation.

When I assemble all reads with haploid option I get genome that is 99% size of reference. But when I assemble with phased option I get 65%.

Yes, phased assembly really only works well with Ultra-Long (UL) reads. Any chance that you might be able to switch to a UL protocol? That gives read N50s between 50 and 100 Kb and allows phased assembly to do a much better job.

At present with binned reads I get 77% of reference size, the data are very similar for the both haplotypes. Then I think it is usable already but also I want more contiguity.

You can probably improve contiguity and the amount of sequence assembled with some tweaking of assembly parameters. But it will probably require some iteration because we don't have a preexisting configuration for your situation. I can help you with this process.

When I compare Nanopore-Human-SingleFlowcell-May2022 with Nanopore-Plants-Apr2021 I see these are different:

#Nanopore-Plants maxSkip = 60 maxDrift = 20 maxTrim = 60 minAlignedMarkerCount = 200 minAlignedFraction = 0.3

#Nanopore-Human-Singleflowcell maxSkip = 30 maxDrift = 15 maxTrim = 30 minAlignedMarkerCount = 200 minAlignedFraction = 0.6

So Plants is less stringent already?

As you can read in the comments for Nanopore-Plants-Apr2021, that configuration was optimized for reads generated with Guppy 4, while the May2022 configurations are optimized for Guppy 5 with "super" accuracy. And the May2022 configurations use longer k-mers (--Kmers.k 14 versus the default 10), so those assembly parameters are not really comparable.

Given all of the above, I suggest the following. First, redo the base calling using "super" accuracy if you haven't already. I know it is a painful and expensive step, but it will be totally worth it give the huge improvement in read quality. Once you have "super" accuracy reads, using the Nanopore-Human-SingleFlowcell-May2022 configuration as your starting point is a good idea, given that is was designed for low coverage. See what you get with that, and then post here AssemblySummary.html and we can go from there.

Thank you very much, it is very nice to have a fast tool like this one.

This should make it easier to iterate assembly parameters.

paoloczi avatar Jul 26 '22 15:07 paoloczi

Hello,

Yes I have super-accuracy by guppy 5.1.14.

I find that Nanopore-Plants-May2021 very outperforms Nanopore-Human-SingleFlowcell and also Nanopore-May2022.

These data show that the automatic parameters in Nanopore-May2022 are close to Plants, except not minAlignedMarkerCount. If I insert its derived value into Plants, the N50 suffers:

base config Np-May2022 Plants Plants
       
Length threshold 5000 5000 5000
       
k 10 10 10
       
minAlignedMarkerCount 85 200 85
minAlignedFraction 0.335 0.3 0.3
maxSkip 41 60 60
maxDrift 15 20 20
maxTrim 59 60 60
       
longest segment 2.9 Mb 3.0 Mb 3.3 Mb
segment N50 203 kb 422 kb 380 kb
       
       
Elapsed time (minutes) 10.4 29.4 29.8

I think with small introns 85 markers is insufficient to resolve abundant repeats.

BUSCO is 90% before polishing so I think is good enough. Thank you for time, now I have question about DeepVariant for separate.

COMInterop avatar Aug 26 '22 19:08 COMInterop