paraphase icon indicating copy to clipboard operation
paraphase copied to clipboard

Understanding CNV Distribution and Counts Across Haplotypes

Open sloth-eat-pudding opened this issue 1 year ago • 5 comments

I have recently begun studying CNVs (Copy Number Variations) and am utilizing the hg002 dataset for paraphase analysis. Upon comparing my paraphase results with the data available on your GitHub page, I noticed that the phase results for SMN1 vary, showing 1 to 3 haplotypes (haps). Specifically, in the case of HG00673, I am intrigued to understand whether the presence of only one haplotype corresponds to a situation of "(0+1)", which I interpret as one copy of SMN1 on one chromosome and none on the other. I am somewhat uncertain about this interpretation and would greatly appreciate your insights to clarify my understanding.

  HG002 HG00673 HG01175 HG01884 HG02300 NA19238
Smn1hap1
Smn1hap2  
Smn1hap3        
Copies 1+1 1+0 1+1 1+1 1+2 1+2
             
Smn2del78hap1          
Smn2hap1
Smn2hap2      
Copies 1+1 1+1 1+0 1+1 1+0 1+0

If my understanding is not accurate, could you please explain how I should interpret the results to determine the corresponding number of copies and the positions associated with the haplotypes? For instance, how can I distinguish between a 0+2 and a 1+1 situation?

Thank you very much for your support and guidance.

sloth-eat-pudding avatar Nov 13 '23 14:11 sloth-eat-pudding

Hi @sloth-eat-pudding the copy numbers are reported in the smn1_cn, smn2_cn and smn2_del78_cn fields in the json file. Even if there is only SMN1 haplotype found, as in the case of HG00673, the SMN1 copy number is 2, because there are two identical SMN1 haplotypes based on depth (you can find this info in the two_copy_haplotypes field). So HG00673 is not a 0+1 carrier. It's only when smn1_cn is called as one that a sample is a carrier.

Please also refer to this page for interpreting SMN1/SMN2.

Paraphase reports summed copy numbers of two alleles, that is, it does not phase SMN1/2 copies into two alleles at the moment. So it does not directly distinguish between 1+1 vs. 0+2. In our paper, we showed that SMN1/2 copies can be assigned to haplogroups (reported in the json file under the haplotype_details field), which can be used for statistical phasing when more population data are summarized into a database in the future. For example, S1-8 and S1-9d are very likely to be on the same allele and they could suggest a silent carrier when a sample only has these two SMN1 copies. Our paper provides an estimate of the probabilities using a limited set of samples, and these calculations can be improved as more and more population samples are sequenced and analyzed.

xiao-chen-xc avatar Nov 13 '23 18:11 xiao-chen-xc

After running the Pacbio HG00673, I carefully reviewed the output JSON file and noticed that the smn1_cn value is listed as 1. Additionally, I observed that the two_copy_haplotypes field is empty. I am concerned about these findings and wonder if I might have made an error in the process.

If providing additional information would facilitate a more comprehensive understanding of the situation, please let me know. I am more than willing to supply any necessary details to resolve this issue.

sloth-eat-pudding avatar Nov 23 '23 10:11 sloth-eat-pudding

Hi @sloth-eat-pudding, what is the input bam file you were using for Paraphase, and could you share with me the command you ran?

xiao-chen-xc avatar Nov 28 '23 22:11 xiao-chen-xc

Could you share with me the json output by Paraphase?

xiao-chen-xc avatar Nov 29 '23 19:11 xiao-chen-xc

I have rechecked the downloaded unmapped BAM file and found that it is corrupted. This has caused issues with my results post realignment. The discrepancy in the results was due to my oversight, and I apologize for any inconvenience caused. I will now restart using …

This is my output JSON. You might not need it anymore.

sloth-eat-pudding avatar Dec 01 '23 08:12 sloth-eat-pudding