yak icon indicating copy to clipboard operation
yak copied to clipboard

Updated output documentation for yak triobin

Open williamrowell opened this issue 4 years ago • 10 comments

https://github.com/lh3/yak/blob/6de3affe265bb0508ed6d78b36f121bdaf796f71/triobin.c#L176

Do you have any updated documentation for the output of yak triobin? I'm looking at the output of verison r43, which has 13 columns as opposed to the 10 columns documented in the help text. I'm especially trying to understand column 2, which has values m, p, a, and 0.

williamrowell avatar Mar 24 '20 17:03 williamrowell

  • m=mother
  • p=father
  • a=ambiguous, 0=noCount/ambiguous

lh3 avatar Mar 25 '20 16:03 lh3

Thanks for the quick answer! That's what I guessed, but wanted to make sure before proceeding. Thanks for the tool!

williamrowell avatar Mar 25 '20 16:03 williamrowell

Forgot to say that you can ignore most of other columns. Those are mostly for debugging purpose.

lh3 avatar Mar 25 '20 16:03 lh3

Dear @lh3,

We are testing out trio binning and it looks like our binned assemblies are more fragmented than the non-binned assemblies. Both haplotypes have good coverage. Is there a way to adjust the triobinning step to be more specific? I.E. require more p/m markers?

What is the meaning of these options?:

  -c INT     min occurrence [2]
  -d INT     mid occurrence [5]

Do you have any suggestions for improving binning at the counting stage?

zeeev avatar Dec 02 '20 14:12 zeeev

By default, if a k-mer occurs 5 times or more in mother but occurs twice or less in father, the k-mer is considered to be a mother-specific k-mer. The label on the 2nd column is determined by the rest of columns under complex rules coded in function tb_classify(). You can't tune these rules on the command line.

It is hard to get perfect trio binning. Hifiasm effectively uses the HiFi assembly graph to fix binning errors. Without doing that, hifiasm would only get ~10Mb N50, comparable to trio HiCanu.

lh3 avatar Dec 02 '20 15:12 lh3

For a simple way to increase specificity:

awk '$3>=21&&$4<=2&&$2=="p"' triobin.txt > paternal.txt
awk '$4>=21&&$3<=2&&$2=="m"' triobin.txt > maternal.txt
# the rest are ambiguous

lh3 avatar Dec 02 '20 15:12 lh3

Hi @lh3,

Thank you for sharing these ideas. Just confirming, you think triobinning isn't as effective as just assembling and phasing in a single genome? That has been my experience, at least using yak and HifiASM/IPA.

zeeev avatar Dec 02 '20 15:12 zeeev

Yes, when HiFi phasing and trio phasing are inconsistent, HiFi phasing is often the correct one.

lh3 avatar Dec 02 '20 15:12 lh3

In early days, we tried hicanu trio binning. I manually inspected many differences between hicanu and yak binning. I think yak is generally more accurate. Nonetheless, the assembly with hicanu binning is similar to the assembly with yak binning.

lh3 avatar Dec 02 '20 15:12 lh3

Also, hifiasm applies trio binning to error corrected reads. This noticeably improves the binning accuracy: there are much fewer inconsistencies between trio phasing and hifi read phasing.

lh3 avatar Dec 02 '20 15:12 lh3