hifiasm icon indicating copy to clipboard operation
hifiasm copied to clipboard

The purging step (-l 3) combined with the purge_dups tool

Open pailloufat-stack opened this issue 2 years ago • 18 comments

Hello,

Is it suitable to run the purge_dups pipeline on my assembly obtained with the 0.15.1 version ?

I assembled a low heterozygous genome (~1.1/1.2Gb expected size) , which gives :

# number of contigs:     854
# total contigs length:  1440196919
# mean contig size:      1686413.25
# contig size first quartile: 34235
# median contig size:         51496
# contig size third quartile: 302436
# longest contig:             74995019
# shortest contig:            15598
# contigs > 500 nt:           854 (100.00 %)
# contigs > 1K nt:            854 (100.00 %)
# contigs > 10K nt:           854 (100.00 %)
# contigs > 100K nt:          319 (37.35 %)
# contigs > 1M nt:            109 (12.76 %)
# N50:                   29396218
# L50:                   15
# N80:                   8003566
# L80:                   43

I was curious to run purge_dups on it , because the size is larger than expected. As output, I obtained :

# number of contigs:     109
# total contigs length:  1121833970
# mean contig size:      10292054.77
# contig size first quartile: 430703
# median contig size:         3327151
# contig size third quartile: 11643196
# longest contig:             74995019
# shortest contig:            25305
# contigs > 500 nt:           109 (100.00 %)
# contigs > 1K nt:            109 (100.00 %)
# contigs > 10K nt:           109 (100.00 %)
# contigs > 100K nt:          99 (90.83 %)
# contigs > 1M nt:            68 (62.39 %)
# N50:                   34583325
# L50:                   11
# N80:                   11643196
# L80:                   27

The results look better after that.

So, I got a bit confused because in the doc, it is said it is not necessary to run purge_dups since there is already a purging step in the 0.15 version.

What do you think about that?

Bests.

pailloufat-stack avatar Jul 28 '21 06:07 pailloufat-stack

Is it a diploid sample? What the heterozygosity rate of it? How do you estimate expected size? The standalone purge_dups often makes primary assembly too small.

chhylp123 avatar Jul 28 '21 07:07 chhylp123

It is a diploid sample yes. I do that assembly for another team, and they know some closed species which are around 1.2Gb and so, they expect that size for that genome.

I used jellyfish to get an idea of the rate of the heterozygosity. In my opinion, that rate is pretty low.

Genomescope_E762_bis plot

pailloufat-stack avatar Jul 28 '21 08:07 pailloufat-stack

Do you have the size of bp.hap1.gfa and bp.hap2.gfa? How do you estimate the genome size of closed species? Is it estimated by the CLR assemblies? CLR assemblies tend to be smaller. For example, CLR assemblies of human are usually 2.8Gb or 2.9Gb in size, which are too small so that many segdups are collapsed. The hifiasm assemblies are around 3.09Gb for human and we believe it should be right as there are not too many duplicated genes. I guess probably you can run BUSCO in assemblies with purge_dups and without purge_dups to see the BUSCO scores.

chhylp123 avatar Jul 28 '21 08:07 chhylp123

If you really want to dedup hifiasm outputs aggressively, you can set --n-hap 3. But I personally think it may make primary assembly too small.

chhylp123 avatar Jul 28 '21 08:07 chhylp123

For both haplytopes, I have :

# number of contigs:     1045
# total contigs length:  1325235316
# mean contig size:      1268167.77
# contig size first quartile: 35642
# median contig size:         68791
# contig size third quartile: 379581
# longest contig:             74782138
# shortest contig:            15598
# contigs > 500 nt:           1045 (100.00 %)
# contigs > 1K nt:            1045 (100.00 %)
# contigs > 10K nt:           1045 (100.00 %)
# contigs > 100K nt:          449 (42.97 %)
# contigs > 1M nt:            139 (13.30 %)
# N50:                   28690572
# L50:                   14
# N80:                   2579315
# L80:                   57

and

# number of contigs:     593
# total contigs length:  1383303835
# mean contig size:      2332721.48
# contig size first quartile: 70276
# median contig size:         212759
# contig size third quartile: 681696
# longest contig:             74946436
# shortest contig:            23687
# contigs > 500 nt:           593 (100.00 %)
# contigs > 1K nt:            593 (100.00 %)
# contigs > 10K nt:           593 (100.00 %)
# contigs > 100K nt:          385 (64.92 %)
# contigs > 1M nt:            125 (21.08 %)
# N50:                   28308572
# L50:                   15
# N80:                   5565258
# L80:                   49

The expected size comes from an article (an assembly with a combination of long reads CLR Pacbio 21X and short reads Illumina 135X). It is 1.26Gb.

I'll run BUSCO on both assemblies (purged and not purged) and I will let you know.

pailloufat-stack avatar Jul 28 '21 09:07 pailloufat-stack

I got my BUSCO results :

busco_figure_E762_Psec

As you said, using purge_dups is too aggressive and reduce the contiguity of my primary assembly. Of course, the number of duplicated genes is reduced with purge_dups (609 -> 311) but I miss too much genes.

pailloufat-stack avatar Jul 29 '21 06:07 pailloufat-stack

Yean, I also guess segdups are collapsed by purge_dups but BUSCO score does not demonstrate that.

chhylp123 avatar Jul 29 '21 11:07 chhylp123

Maybe you can set smaller value for -s, like -s 0.5 or -s 0.45 to see if hifiasm can reduce the genome size without losing genes.

chhylp123 avatar Jul 29 '21 12:07 chhylp123

As you suggested to set -s 0.45 , I tried it. I nearly got the same assembly metrics :

# number of contigs:     853
# total contigs length:  1440131087
# mean contig size:      1688313.11
# contig size first quartile: 34235
# median contig size:         51496
# contig size third quartile: 302436
# longest contig:             74995019
# shortest contig:            15598
# contigs > 500 nt:           853 (100.00 %)
# contigs > 1K nt:            853 (100.00 %)
# contigs > 10K nt:           853 (100.00 %)
# contigs > 100K nt:          319 (37.40 %)
# contigs > 1M nt:            109 (12.78 %)
# N50:                   29396218
# L50:                   15
# N80:                   8003566
# L80:                   43

pailloufat-stack avatar Aug 02 '21 06:08 pailloufat-stack

As the heterozygosity rate of your sample is low, I think hifiasm shouldn't have issue. Probably the default results are not too bad.

chhylp123 avatar Aug 02 '21 12:08 chhylp123

Probably you can also run BUSCO on *bp.hap1* and *bp.hap2* to see the results.

chhylp123 avatar Aug 02 '21 12:08 chhylp123

I agree, I should stay with the defaults v15 results. That's the BUSCO results on both haplotypes :

busco_figure_haplotypes_v15_Psec

pailloufat-stack avatar Aug 03 '21 11:08 pailloufat-stack

I also used the v0.12 version of hifiasm , and I got these metrics on the primary assembly :

# number of contigs:     653
# total contigs length:  1333035279
# mean contig size:      2041401.65
# N50:                   35486345
# L50:                   12
# N80:                   14755458
# L80:                   28

They are better than the primary assembly of v0.15 , is that normal?

I also get the same problem if I use purge_dups on the primary assembly obtained with 0.15 (reducing the size contiguity) :

busco_figure_v12_assembly

pailloufat-stack avatar Aug 03 '21 11:08 pailloufat-stack

Well, I don't think v0.12 is much better than v0.15. If you want to make v0.15 outputs primary assembly like v0.12, you can use --n-hap 3. This will disable purging optimization with diploid assumption.

chhylp123 avatar Aug 03 '21 14:08 chhylp123

When I set up --n-hap 3 with v0.15, I got :

# number of contigs:     828
# total contigs length:  1431860996
# mean contig size:      1729300.72
# contig size first quartile: 34036
# median contig size:         50048
# contig size third quartile: 280332
# longest contig:             74995019
# shortest contig:            15598
# contigs > 500 nt:           828 (100.00 %)
# contigs > 1K nt:            828 (100.00 %)
# contigs > 10K nt:           828 (100.00 %)
# contigs > 100K nt:          302 (36.47 %)
# contigs > 1M nt:            104 (12.56 %)
# N50:                   31566718
# L50:                   15
# N80:                   8251318
# L80:                   41

What it looks pretty close of the v0.15 default paramaters results. even better (854 -> 828 contigs and N50 29.3Mb -> 31.5Mb). The size remains similar (1.440Gb -> 1.431Mb) . That's the BUSCO score : C:97.4%[S:87.7%,D:9.7%],F:0.5%,M:2.1%,n:5950

But if I compared to the v0.12 default paramaters primary assembly, even if I use --n-hap 3 , it does not seem better. Whether it be in number of contigs or genome size or N50... And the v0.12 assembly size looks closer to the reality. That's the metrics of the v0.12 assembly :

# number of contigs:     653
# total contigs length:  1333035279
# mean contig size:      2041401.65
# contig size first quartile: 31810
# median contig size:         40830
# contig size third quartile: 75017
# longest contig:             74995018
# shortest contig:            15598
# contigs > 500 nt:           653 (100.00 %)
# contigs > 1K nt:            653 (100.00 %)
# contigs > 10K nt:           653 (100.00 %)
# contigs > 100K nt:          140 (21.44 %)
# contigs > 1M nt:            68 (10.41 %)
# N50:                   35486345
# L50:                   12
# N80:                   14755458
# L80:                   28

When I look at the v0.12 BUSCO score , I have : C:91.6%[S:86.6%,D:5.0%],F:0.5%,M:7.9%,n:5950 .

Would you rather be confident in the metrics (N50 , number of contigs ...) or in the BUSCO scores?

pailloufat-stack avatar Aug 04 '21 07:08 pailloufat-stack

Probably can you have a look what are these duplicated genes that only appear in v0.15 assembly? Another possibility is that BUSCO is not such reliable. Based on our experiments, I personally think v0.15 is definitely better. But maybe your sample is special. All these things are case-by-case. By the way, I guess version number less than 0.14.2 should be able to generate similar assembly as v0.12.

chhylp123 avatar Aug 04 '21 12:08 chhylp123

@pailloufat-stack does this sample have different sex chromosomes (like chrX/chrY for human males)?

lh3 avatar Aug 04 '21 13:08 lh3

@chhylp123 I'm gonna have a look on these duplicated genes, you're right! I will let you know.

So, many of the duplicated genes (393) found by BUSCO are the same between the v0.12 and the v0.15.

216 are unique for the v0.15 and I had a quick look on some of them. They are involved into resistance processes, what is not really surprising to spot duplicated genes into that kind of process as far as I know. Also, I have a publication which talks about duplication events for my genome, and the high rate for my v0.15 assembly could be close to the reality.

Does the small bump at 50X in my genomescope histogram correspond to repeated regions?

@lh3 The sample doesn't have sexual chromosomes.

pailloufat-stack avatar Aug 04 '21 15:08 pailloufat-stack