msmc2 icon indicating copy to clipboard operation
msmc2 copied to clipboard

using phased-vcf file as input for MSMC2

Open Niloofar-Alaei opened this issue 2 years ago • 2 comments
trafficstars

Hi I want to run the MSMC2 for my dataset which is phased vcf files (multi-sample vcf file with 26 samples) for each chromosome separately (i.e. Chr10.vcf.gz).

I did this process as follows, to use my vcf files as input for running the MSMC2:

First, use the bcftools to produce a separate vcf file for each sample (i.e. sample1.Chr10.vcf.gz). Second, use the vcfAllSiteParser.py to produce the .bed files. and then running generate_multihetsep.py to merge VCF and mask files together. *I didn’t do the phasing step, because I supposed that it should include my phasing dataset.

But I received an error in the last step when I ran msmc2 for Estimating the effective population size. I noticed that produced multihetsep.txt files (i.e. Chr10. multihetsep.txt) are too heavy also.

My question is, should I run the phasing step too?

I really appreciate your help in helping me identify the problem.

With the best Niloo

Niloofar-Alaei avatar Jun 15 '23 14:06 Niloofar-Alaei

As discussed via email, I think the issue is that your phased VCF is not recognised as being phased. Phased genotypes require a notation like 0|1 or 1|0. If you have 0/1 instead, it is being treated as unphased, leading to combinatorially many combinations and breaking your resulting file in terms of size.

stschiff avatar Jun 22 '23 11:06 stschiff

yes, we discussed and I also checked my phasing vcf files. the problem is from their format and I am working to solve it.

Many thanks for your help

Niloo


From: Stephan Schiffels @.***> Sent: 22 June 2023 13:36:26 To: stschiff/msmc2 Cc: Niloofar Alaei Kakhki; Author Subject: Re: [stschiff/msmc2] using phased-vcf file as input for MSMC2 (Issue #52)

As discussed via email, I think the issue is that your phased VCF is not recognised as being phased. Phased genotypes require a notation like 0|1 or 1|0. If you have 0/1 instead, it is being treated as unphased, leading to combinatorially many combinations and breaking your resulting file in terms of size.

— Reply to this email directly, view it on GitHubhttps://github.com/stschiff/msmc2/issues/52#issuecomment-1602484595, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANPOBGGLIV4G544GFT3XIOTXMQU3VANCNFSM6AAAAAAZH7BCBU. You are receiving this because you authored the thread.Message ID: @.***>

Niloofar-Alaei avatar Jun 22 '23 11:06 Niloofar-Alaei