yahs
yahs copied to clipboard
Assembly metrics have gotten worse than hifiasm-only assembly
Hi Zhou: After running yahs, I have gotten worse assembly metrics than with the hifiasm-only assembly. The number of genome scaffolds has increased from 1,200 to 9,500, the scaffold N50s have decreased from 32MB to 9MB, and the max scaffold length has declined from 200 MB to 100 MB. My assembly metrics also became worse when using SALSA2 and when running HiFiasm with Hi-C integration. I am using Hi-C data from Dovetail Genomics that was developed using their Chicago Library approach, and I used the Arima pipeline to generate BAM files. I have wondered if there is an issue with my Hi-C data, but my stats file from the Arima pipeline suggests the Hi-C data is good. We have 30X coverage with HiFi data. We have a mammal species with a challenging genome to assemble because of its relatively large genome size (6GB), high repeat content (~50%), and high levels of heterozygosity. We are studying a diploid species with 24 chromosomes. slurm-23990094.out.txt
Do you have thoughts on why our HiFi assembly is getting worse when we scaffold with Hi-C data?
Thank you in advance. cheers, Andreas
Hello @andreaschavez,
It seems like YaHS/SALSA2 made too many contig breaks. The first thing you could try is to run YaHS with the option --no-contig-ec
which will suppress contig breaks. But with this option, you will likely see a lot of oddness in your HiC maps - either for contigs (you can check those big ones) or for scaffolds after scaffolding.
I am not sure about the problem. Most likely, your HiC data quality is poor. Have you checked the HiC mapping results, such as the mapping rate, mapping quality etc.? Also, is it possible the HiC data was from a different sample or species?
Best, Chenxi
Hi Chenxi: I will give the no-contig-ec command a try.
According to the stats file generated with the Arima pipeline, I believe our Hi-C data is pretty good, with 95% of the intra data being >20kb "long-cis interactions." The Hi-C data were from the same individual sample as the HiFi data. I'll report back. Thanks. Andreas
Arima Stats | Reads | % reads | description |
---|---|---|---|
All | 312,571,918 | ||
All inter "trans interactions" | 32,156,505 | 10% | inter/all |
All intra "short and long cis interactions" | 280,415,413 | 90% | intra/all |
All intra 1kb | 10,214,550 | ||
All intra 10kb | 1,875,255 | ||
All intra 15kb | 948,391 | ||
All intra 20kb "short-cis interactions" | 591,696 | 5% | all <20kb/intra total |
All intra >20kb "long-cis interactions" | 266,785,521 | 95% | all >20kb/intra total |
**All intra "short and long cis interactions" | 280,415,413 | 85% | all >20kb/all** |
Hi Zhou: After running yahs, I have gotten worse assembly metrics than with the hifiasm-only assembly. The number of genome scaffolds has increased from 1,200 to 9,500, the scaffold N50s have decreased from 32MB to 9MB, and the max scaffold length has declined from 200 MB to 100 MB. My assembly metrics also became worse when using SALSA2 and when running HiFiasm with Hi-C integration. I am using Hi-C data from Dovetail Genomics that was developed using their Chicago Library approach, and I used the Arima pipeline to generate BAM files. I have wondered if there is an issue with my Hi-C data, but my stats file from the Arima pipeline suggests the Hi-C data is good. We have 30X coverage with HiFi data. We have a mammal species with a challenging genome to assemble because of its relatively large genome size (6GB), high repeat content (~50%), and high levels of heterozygosity. We are studying a diploid species with 24 chromosomes. slurm-23990094.out.txt
Do you have thoughts on why our HiFi assembly is getting worse when we scaffold with Hi-C data?
Thank you in advance. cheers, Andreas
Hi Andreas,
I am facing a similar problem and was wondering if you were able to solve the issue? I have used "no-contig-ec" in my existing command but still facing the same problem of better stats at the contig level assembly.
Thanks, Afiya
Does the Hi-C come from the same individual as the PacBio?
From: afiyachida @.> Date: Monday, 8 April 2024 at 22:23 To: c-zhou/yahs @.> Cc: Subscribed @.***> Subject: Re: [c-zhou/yahs] Assembly metrics have gotten worse than hifiasm-only assembly (Issue #53)
Hi Zhou: After running yahs, I have gotten worse assembly metrics than with the hifiasm-only assembly. The number of genome scaffolds has increased from 1,200 to 9,500, the scaffold N50s have decreased from 32MB to 9MB, and the max scaffold length has declined from 200 MB to 100 MB. My assembly metrics also became worse when using SALSA2 and when running HiFiasm with Hi-C integration. I am using Hi-C data from Dovetail Genomics that was developed using their Chicago Library approach, and I used the Arima pipeline to generate BAM files. I have wondered if there is an issue with my Hi-C data, but my stats file from the Arima pipeline suggests the Hi-C data is good. We have 30X coverage with HiFi data. We have a mammal species with a challenging genome to assemble because of its relatively large genome size (6GB), high repeat content (~50%), and high levels of heterozygosity. We are studying a diploid species with 24 chromosomes. slurm-23990094.out.txthttps://github.com/c-zhou/yahs/files/11029482/slurm-23990094.out.txt
Do you have thoughts on why our HiFi assembly is getting worse when we scaffold with Hi-C data?
Thank you in advance. cheers, Andreas
Hi Andreas,
I am facing a similar problem and was wondering if you were able to solve the issue? I have used "no-contig-ec" in my existing command but still facing the same problem of better stats at the contig level assembly.
Thanks, Afiya
— Reply to this email directly, view it on GitHubhttps://github.com/c-zhou/yahs/issues/53#issuecomment-2043666174, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA2FXZXPQZLHEPDSDVCC7R3Y4MDGVAVCNFSM6AAAAAAWCNCCPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBTGY3DMMJXGQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Does the Hi-C come from the same individual as the PacBio? From: afiyachida @.> Date: Monday, 8 April 2024 at 22:23 To: c-zhou/yahs @.> Cc: Subscribed @.> Subject: Re: [c-zhou/yahs] Assembly metrics have gotten worse than hifiasm-only assembly (Issue #53) Hi Zhou: After running yahs, I have gotten worse assembly metrics than with the hifiasm-only assembly. The number of genome scaffolds has increased from 1,200 to 9,500, the scaffold N50s have decreased from 32MB to 9MB, and the max scaffold length has declined from 200 MB to 100 MB. My assembly metrics also became worse when using SALSA2 and when running HiFiasm with Hi-C integration. I am using Hi-C data from Dovetail Genomics that was developed using their Chicago Library approach, and I used the Arima pipeline to generate BAM files. I have wondered if there is an issue with my Hi-C data, but my stats file from the Arima pipeline suggests the Hi-C data is good. We have 30X coverage with HiFi data. We have a mammal species with a challenging genome to assemble because of its relatively large genome size (6GB), high repeat content (~50%), and high levels of heterozygosity. We are studying a diploid species with 24 chromosomes. slurm-23990094.out.txthttps://github.com/c-zhou/yahs/files/11029482/slurm-23990094.out.txt Do you have thoughts on why our HiFi assembly is getting worse when we scaffold with Hi-C data? Thank you in advance. cheers, Andreas Hi Andreas, I am facing a similar problem and was wondering if you were able to solve the issue? I have used "no-contig-ec" in my existing command but still facing the same problem of better stats at the contig level assembly. Thanks, Afiya — Reply to this email directly, view it on GitHub<#53 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA2FXZXPQZLHEPDSDVCC7R3Y4MDGVAVCNFSM6AAAAAAWCNCCPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBTGY3DMMJXGQ. You are receiving this because you are subscribed to this thread.Message ID: @.>
Hello,
Yes. The HiC is from the same individual. I am either facing better metrics at the contig level or, in certain cases, the metrics remain the same as at contig level assembly and do not improve.
Thanks, Afiya
I have the same problem. YAHS is breaking full length chromosomes into smaller pieces and --no-scaffold-ec
seems to have no effect.
Please provide a fix.
Hi @gunjanpandey,
There are several possibilities. It could simply because your HiC data is not good enough, or could because there are a lot of miss assemblies. Instead of instead stead of '--no-scaffold-ec', you need '--no-contig-ec', which ask YaHS to not make any config error corrections before scaffolding. While this will prevent YaHs from getting a worse assembly, it is very likely you will not see much improvement on the assembly contiguity after scaffolding.
A more likely reason is that your genome is very repetitive. YaHS uses a mapping quality threshold of 10 for alignment filtering by default. With this filtering, many regions of your genome will be no HiC links. You can make a HiC plot for a quick check. If that is the case, you could probably try to run YaHS with -q 0
, which will force YaHS to use all HiC links irrespective of mapping qualities - this is kind of risky though.
Best, Chenxi
Also, you need to make sure the low mapping quality reads were not filtered out in your input alignment file if want to run YaHS with -q 0
. You can use the BAM file input, which usually includes all alignments.
Chenxi