yahs icon indicating copy to clipboard operation
yahs copied to clipboard

Assembly metrics have gotten worse than hifiasm-only assembly

Open andreaschavez opened this issue 1 year ago • 5 comments

Hi Zhou: After running yahs, I have gotten worse assembly metrics than with the hifiasm-only assembly. The number of genome scaffolds has increased from 1,200 to 9,500, the scaffold N50s have decreased from 32MB to 9MB, and the max scaffold length has declined from 200 MB to 100 MB. My assembly metrics also became worse when using SALSA2 and when running HiFiasm with Hi-C integration. I am using Hi-C data from Dovetail Genomics that was developed using their Chicago Library approach, and I used the Arima pipeline to generate BAM files. I have wondered if there is an issue with my Hi-C data, but my stats file from the Arima pipeline suggests the Hi-C data is good. We have 30X coverage with HiFi data. We have a mammal species with a challenging genome to assemble because of its relatively large genome size (6GB), high repeat content (~50%), and high levels of heterozygosity. We are studying a diploid species with 24 chromosomes. slurm-23990094.out.txt

Do you have thoughts on why our HiFi assembly is getting worse when we scaffold with Hi-C data?

Thank you in advance. cheers, Andreas

andreaschavez avatar Mar 21 '23 13:03 andreaschavez

Hello @andreaschavez,

It seems like YaHS/SALSA2 made too many contig breaks. The first thing you could try is to run YaHS with the option --no-contig-ec which will suppress contig breaks. But with this option, you will likely see a lot of oddness in your HiC maps - either for contigs (you can check those big ones) or for scaffolds after scaffolding.

I am not sure about the problem. Most likely, your HiC data quality is poor. Have you checked the HiC mapping results, such as the mapping rate, mapping quality etc.? Also, is it possible the HiC data was from a different sample or species?

Best, Chenxi

c-zhou avatar Apr 03 '23 14:04 c-zhou

Hi Chenxi: I will give the no-contig-ec command a try.

According to the stats file generated with the Arima pipeline, I believe our Hi-C data is pretty good, with 95% of the intra data being >20kb "long-cis interactions." The Hi-C data were from the same individual sample as the HiFi data. I'll report back. Thanks. Andreas

Arima Stats Reads % reads description
All 312,571,918    
All inter "trans interactions" 32,156,505 10% inter/all
All intra "short and long cis interactions" 280,415,413 90% intra/all
       
All intra 1kb 10,214,550    
All intra 10kb 1,875,255    
All intra 15kb 948,391    
All intra 20kb "short-cis interactions" 591,696 5% all <20kb/intra total
All intra >20kb "long-cis interactions" 266,785,521 95% all >20kb/intra total
**All intra "short and long cis interactions" 280,415,413 85% all >20kb/all**

andreaschavez avatar Apr 05 '23 16:04 andreaschavez

Hi Zhou: After running yahs, I have gotten worse assembly metrics than with the hifiasm-only assembly. The number of genome scaffolds has increased from 1,200 to 9,500, the scaffold N50s have decreased from 32MB to 9MB, and the max scaffold length has declined from 200 MB to 100 MB. My assembly metrics also became worse when using SALSA2 and when running HiFiasm with Hi-C integration. I am using Hi-C data from Dovetail Genomics that was developed using their Chicago Library approach, and I used the Arima pipeline to generate BAM files. I have wondered if there is an issue with my Hi-C data, but my stats file from the Arima pipeline suggests the Hi-C data is good. We have 30X coverage with HiFi data. We have a mammal species with a challenging genome to assemble because of its relatively large genome size (6GB), high repeat content (~50%), and high levels of heterozygosity. We are studying a diploid species with 24 chromosomes. slurm-23990094.out.txt

Do you have thoughts on why our HiFi assembly is getting worse when we scaffold with Hi-C data?

Thank you in advance. cheers, Andreas

Hi Andreas,

I am facing a similar problem and was wondering if you were able to solve the issue? I have used "no-contig-ec" in my existing command but still facing the same problem of better stats at the contig level assembly.

Thanks, Afiya

afiyachida avatar Apr 08 '24 21:04 afiyachida

Does the Hi-C come from the same individual as the PacBio?

From: afiyachida @.> Date: Monday, 8 April 2024 at 22:23 To: c-zhou/yahs @.> Cc: Subscribed @.***> Subject: Re: [c-zhou/yahs] Assembly metrics have gotten worse than hifiasm-only assembly (Issue #53)

Hi Zhou: After running yahs, I have gotten worse assembly metrics than with the hifiasm-only assembly. The number of genome scaffolds has increased from 1,200 to 9,500, the scaffold N50s have decreased from 32MB to 9MB, and the max scaffold length has declined from 200 MB to 100 MB. My assembly metrics also became worse when using SALSA2 and when running HiFiasm with Hi-C integration. I am using Hi-C data from Dovetail Genomics that was developed using their Chicago Library approach, and I used the Arima pipeline to generate BAM files. I have wondered if there is an issue with my Hi-C data, but my stats file from the Arima pipeline suggests the Hi-C data is good. We have 30X coverage with HiFi data. We have a mammal species with a challenging genome to assemble because of its relatively large genome size (6GB), high repeat content (~50%), and high levels of heterozygosity. We are studying a diploid species with 24 chromosomes. slurm-23990094.out.txthttps://github.com/c-zhou/yahs/files/11029482/slurm-23990094.out.txt

Do you have thoughts on why our HiFi assembly is getting worse when we scaffold with Hi-C data?

Thank you in advance. cheers, Andreas

Hi Andreas,

I am facing a similar problem and was wondering if you were able to solve the issue? I have used "no-contig-ec" in my existing command but still facing the same problem of better stats at the contig level assembly.

Thanks, Afiya

— Reply to this email directly, view it on GitHubhttps://github.com/c-zhou/yahs/issues/53#issuecomment-2043666174, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA2FXZXPQZLHEPDSDVCC7R3Y4MDGVAVCNFSM6AAAAAAWCNCCPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBTGY3DMMJXGQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

richarddurbin avatar Apr 09 '24 23:04 richarddurbin

Does the Hi-C come from the same individual as the PacBio? From: afiyachida @.> Date: Monday, 8 April 2024 at 22:23 To: c-zhou/yahs @.> Cc: Subscribed @.> Subject: Re: [c-zhou/yahs] Assembly metrics have gotten worse than hifiasm-only assembly (Issue #53) Hi Zhou: After running yahs, I have gotten worse assembly metrics than with the hifiasm-only assembly. The number of genome scaffolds has increased from 1,200 to 9,500, the scaffold N50s have decreased from 32MB to 9MB, and the max scaffold length has declined from 200 MB to 100 MB. My assembly metrics also became worse when using SALSA2 and when running HiFiasm with Hi-C integration. I am using Hi-C data from Dovetail Genomics that was developed using their Chicago Library approach, and I used the Arima pipeline to generate BAM files. I have wondered if there is an issue with my Hi-C data, but my stats file from the Arima pipeline suggests the Hi-C data is good. We have 30X coverage with HiFi data. We have a mammal species with a challenging genome to assemble because of its relatively large genome size (6GB), high repeat content (~50%), and high levels of heterozygosity. We are studying a diploid species with 24 chromosomes. slurm-23990094.out.txthttps://github.com/c-zhou/yahs/files/11029482/slurm-23990094.out.txt Do you have thoughts on why our HiFi assembly is getting worse when we scaffold with Hi-C data? Thank you in advance. cheers, Andreas Hi Andreas, I am facing a similar problem and was wondering if you were able to solve the issue? I have used "no-contig-ec" in my existing command but still facing the same problem of better stats at the contig level assembly. Thanks, Afiya — Reply to this email directly, view it on GitHub<#53 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA2FXZXPQZLHEPDSDVCC7R3Y4MDGVAVCNFSM6AAAAAAWCNCCPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBTGY3DMMJXGQ. You are receiving this because you are subscribed to this thread.Message ID: @.>

Hello,

Yes. The HiC is from the same individual. I am either facing better metrics at the contig level or, in certain cases, the metrics remain the same as at contig level assembly and do not improve.

Thanks, Afiya

afiyachida avatar Apr 11 '24 17:04 afiyachida

I have the same problem. YAHS is breaking full length chromosomes into smaller pieces and --no-scaffold-ec seems to have no effect.

Please provide a fix.

gunjanpandey avatar Oct 22 '24 10:10 gunjanpandey

Hi @gunjanpandey,

There are several possibilities. It could simply because your HiC data is not good enough, or could because there are a lot of miss assemblies. Instead of instead stead of '--no-scaffold-ec', you need '--no-contig-ec', which ask YaHS to not make any config error corrections before scaffolding. While this will prevent YaHs from getting a worse assembly, it is very likely you will not see much improvement on the assembly contiguity after scaffolding.

A more likely reason is that your genome is very repetitive. YaHS uses a mapping quality threshold of 10 for alignment filtering by default. With this filtering, many regions of your genome will be no HiC links. You can make a HiC plot for a quick check. If that is the case, you could probably try to run YaHS with -q 0, which will force YaHS to use all HiC links irrespective of mapping qualities - this is kind of risky though.

Best, Chenxi

c-zhou avatar Oct 22 '24 11:10 c-zhou

Also, you need to make sure the low mapping quality reads were not filtered out in your input alignment file if want to run YaHS with -q 0. You can use the BAM file input, which usually includes all alignments.

Chenxi

c-zhou avatar Oct 22 '24 11:10 c-zhou