hifiasm icon indicating copy to clipboard operation
hifiasm copied to clipboard

So many duplications.

Open paul-bio opened this issue 2 years ago • 20 comments

Hello, I recently performed a de novo genome assembly using HiFiasm. And I have Hifi sequencing data.

First, thank you let us use this wonderful tool. But when i run for the first time, I got lot of duplicated genes in dip.p_ctg.fa (seems diploid assembled file)

here is the results I got from BUSCO analysis.

For hap1.p_ctg.fa image with contigs number of 22,298

For hap2.p_ctg.fa image with contigs number of 16,918

For dip.p_ctg.fa image with contigs number of 39,216

It seems like genes are lost in hap1.p_ctg.fa but when i look up the dip.p_ctg.fa, I got reliable result.

So I run a purge_hapolotig with dip.p_ctg.fa file, and got like this. image

(ps. since my sample was small in size, I used tissues from 6 individuals and pool them to conduct hifi sequencing.)

Do you have any suggestion? It would be best when I use result from dip.p_ctg.fa...

Since the relative species had heterozygosity rate of 2.41%, should I rerun with -l3 and lower -s value to 0.75->0.55 ??

Thanks anyway.

paul-bio avatar Jul 01 '22 12:07 paul-bio

Are these assemblies much larger than the estimated genome size?

chhylp123 avatar Jul 01 '22 22:07 chhylp123

Hi @chhylp123

It seems this species has 18Gb of genome size.

Here are the summary of genome result.

For hap1.p_ctg.fa image

For hap2.p_ctg.fa image

And for dip.p_ctg.fa image

paul-bio avatar Jul 02 '22 01:07 paul-bio

Looks like the coverage information was inferred incorrectly by hifiasm. Could you please have a try with the methods listed here (https://hifiasm.readthedocs.io/en/latest/faq.html#why-the-size-of-primary-assembly-or-partially-phased-assembly-is-much-larger-than-the-estimated-genome-size)?

chhylp123 avatar Jul 02 '22 01:07 chhylp123

Thanks @chhylp123 .

I will rerun hifiasm with two ways (-s 0.5 and -s 0.45)

But since this species has highly heterozygosity genome, should I change -l parameters as well?

And can you tell me what is the default parameter of -l ?

Thanks again. From Paul.

paul-bio avatar Jul 04 '22 00:07 paul-bio

Sorry for the late reply. The default value is 1. You can have a try to adjust it since rerun hifiasm with bin files should be very fast.

chhylp123 avatar Jul 07 '22 17:07 chhylp123

Meantime, I tried twice with -s 5 -s0.45 and here is the result

image

When running with -s 0.45, I got haplotigs which were relatively in same size.But This species is expected to have 1.8~1.9Gb of genome size. and it seems not much duplications are murged enough when compare diploid stats with haploid stat.

In this case, should I run with -l 3 ??

and i uploaded a log file also just in case... nohup.out.txt .

paul-bio avatar Jul 08 '22 05:07 paul-bio

Sorry for the late reply. What if you have a try with https://github.com/dfguan/purge_dups? The default value for -l is -l3.

chhylp123 avatar Jul 13 '22 20:07 chhylp123

Previously you mentioned, default value for -l is 1 image

It is bit confusing,, is the default value for -l is -l1 or -l3?

Thanks for reply @chhylp123

paul-bio avatar Jul 14 '22 00:07 paul-bio

It's my bad. The default value for -l is 3.

chhylp123 avatar Jul 14 '22 00:07 chhylp123

Thanks,

image

Meantime I run 11 times with different parameters. And seems it would be best if I purge duplications with p_ctg.fa file (diploid type).

In this case, can i perform purge_dups with p_ctg.fa(diploid) rather than hap1.p_ctg.fa(haploid1)?

paul-bio avatar Jul 14 '22 01:07 paul-bio

Yes. Please note that when you align HiFi reads, it would be better to utilize both p_ctg.gfa and a_ctg.gfa.

chhylp123 avatar Jul 14 '22 01:07 chhylp123

Thank you so much for your suggestions.

I will try and let you know what the results were.

paul-bio avatar Jul 14 '22 01:07 paul-bio

Hello @chhylp123 ,

When using purge_dups, I need alternative assembly hap_asm.

however alternative files comes when there is --primary option is fed. In question #243 , you said .p_utg.gfa is a alternative contig file.

Can I use .p_utg.fa as a a_ctg.fa (alternative assembly hap_asm)?

Thanks.

paul-bio avatar Jul 14 '22 06:07 paul-bio

No, p_utg.gfa is the assembly graph. You should use a_ctg.gfa as the alternative assembly.

chhylp123 avatar Jul 14 '22 18:07 chhylp123

Hi @chhylp123 there were different results.

When I run hifiams using command below, $hifiasm -l1 -s 0.5 --hg-size 1.9g --primary

I got *a_ctg.fa (alternative contigs) and *p_ctg.fa (primary contigs) both of which were used for purge_dups.

And I got BUSCO values of C 76.1, S 71.4, D 4.7, F 8.2, M 15.7.

However, I rerun hifiasm without --primary options and used *p_ctg.fa and *p_utg.fa for purge_dups (just for test purposes).

And I got C 93.3, S 78.3, D 15.0, F 3.8, M 2.9.

Is it okay for me to use *p_utg.fa as an alternative contig?

paul-bio avatar Jul 18 '22 13:07 paul-bio

Purge_dups is a little bit tricky to run. There are a lot of issues at the repo of Purge_dups to discuss how to select appropriate parameters for it, so you need to take care of that. In addition, could you check the coordinates of those duplicated genes, and manually filter out some wrong duplications? BUSCO is not such accurate in some cases.

chhylp123 avatar Jul 25 '22 20:07 chhylp123

Hello paul

I'm also about to assemble a species with a very large genome recently. This specie own 17G genome. It's sequencing 30× Now the sequencing company only return 2 cells data, it‘s 4×. Now, I want to try assembly this species genome. I am worry about my server incompetence.
The RAM is 1T, 80 threads. Now the hifiasm is running, Only this 4× data are using 626G memory, and I am not sure it whether or not continue using memory. I am worry about if I using 30× hifi data, the server it's ok. So, Could you tell me how many memory used when you used when you assembly?

Many thx.

zhang144999 avatar Aug 15 '22 04:08 zhang144999

4x coverage might take more RAM than 30x. 1TB should be fine.

chhylp123 avatar Aug 15 '22 13:08 chhylp123

Hi, chhylp123

Does this mean that our servers can assemble this 17G species if all 30x data return. What about adding hic data? Our hic data is 100x.

Many thx.

zhang144999 avatar Aug 17 '22 07:08 zhang144999

I think it should be fine. Low coverage confuses hifiasm so that it cannot identify the right parameters, making the memory requirement extremely large.

chhylp123 avatar Aug 17 '22 13:08 chhylp123