HiC-Pro icon indicating copy to clipboard operation
HiC-Pro copied to clipboard

Allele specific analysis for human with out phased genome data

Open JFF1594032292 opened this issue 3 years ago • 4 comments

Hello Nicholas Firstly thanks for developing this tool, it's really helpful! Now I have Hi-C data from one individual, and I want to compare the contacts between two alleles (to see if allelic heterogeneity can affect the interaction between two segments) for ~3000 heterozygous SNPs. The allele specific mode in HiC-Pro may resolve it, which I think it's a similar task with the "Allele-specific contact maps" section in your article "HiC-Pro: an optimized and flexible pipeline for Hi-C data processing". In your paper, the phased GM12878 data can be directly obtained while I only have sequencing data from Hi-C library. So I can't generate the .vcf file to mask genome or input file for ALLELE_SPECIFIC_SNP in config.txt. Maybe I can generate phased genome from Hi-C sequence data? But I worried about the quality of phasing because of the low depth in many loci. I wonder how can I handle it, or if I have misunderstanding about the pipeline? Thanks

JFF1594032292 avatar May 08 '21 09:05 JFF1594032292

Hi, So you do have a list of 3000 SNPs, but they are not phased ? is that correct ? As soon as you have this list of SNPs, you should be able to mask your genome at these loci, and to generate a vcf-like file. However, you will not be able to extract real allele-specific contact ... I'm not an expert, but I guess that some assembly tools should be able to infer phased genotypes from Hi-C. However, I do not know how much data they require, nor as efficient they are. Sorry N

nservant avatar May 10 '21 12:05 nservant

Thanks for your reply! It's really helpful.

But I am also confused about which step needs phasing data. I downloaded the example data from https://zerkalo.curie.fr/partage/HiC-Pro/HiCPro_testdata_as.tar.gz . The snps_CAST_129S1.vcf file in this folder only contains SNPs with 0/1 rather than 0|1, which the 0/1 means that they are not phased yet. But I can run the pipeline to get a matrix with G1 and G2 labeled. Is there something wrong with my understanding?

Actually at the beginning, I thought this vcf file should contain the phased genotype data (and only heterozygous SNPs) for the Hi-C sample, but this test data confused me....

Thanks,

Jiang

JFF1594032292 avatar May 11 '21 07:05 JFF1594032292

Hi Jiang, Indeed, sorry for that. The SNPs are phased but noted "0/1". I agree that it should be "0|1" ... Btw, I may have to check the code, because I'm not sure the genotype information will be parsed correcty with a "|" ... I'll keep the issue open to double check that.

So you are absolutly right, the vcf should indeed contain phased genotype and only heterozyguous SNPs. Sorry for the confusion. N

nservant avatar May 11 '21 07:05 nservant

Thanks for your quick reply! I understand better now~ : )

Jiang

JFF1594032292 avatar May 11 '21 07:05 JFF1594032292