mocha icon indicating copy to clipboard operation
mocha copied to clipboard

Unable to infer the A and B alleles while parsing the site...

Open AkshajD opened this issue 9 months ago • 11 comments

I have a batch of VCF files from an array that I am trying to add ALLELE_A and ALLELE_B into to be able to run them through MoChA. I used the mochatools command shown below to do so:

bcftools +mochatools $input -- -t ALLELE_A,ALLELE_B,GC -f $reference > $output

I am getting the error: Unable to infer the A and B alleles while parsing the site: for all non 0/0 sites. Can you please offer some advice on why this might be the case and how to fix it?

P.S. Not sure if this would have anything to do with it, but the VCFs were pre-generated by the org that provides our dataset, but we had to add in the LRR and BAF fields manually afterwards.

AkshajD avatar Nov 11 '23 17:11 AkshajD

The BCFtools/mochatools plugin will infer which allele is the A and B allele as long as at least one homozygous AA or one homozygous BB allele is observed. All sites for which all samples are heterozygous will not be inferrable. It is simply not possible to do so. If you have enough samples in the VCF, this should not be a problem. Are you running the tool on a single sample VCF? My advice is to go back to the org that provides your dataset and tell them to do the right thing and give you the IDAT files (or CEL files if it is Affymetrix data)

freeseek avatar Nov 13 '23 15:11 freeseek

I have attempted running it with both a single VCF (which I now understand why it would give an error), and then with a test VCF with 10 samples. The error persists.

Is this just a result of still not having a sufficient number of samples? Just want to resolve the issue before we run the algo to add LRR and BAF to tons of different VCFs.

Will try to follow up with the org but have had low success with them about this issue in the past.

AkshajD avatar Nov 14 '23 03:11 AkshajD

With 10 samples in a VCF, for a very common variant with minor allele frequency close to 0.5 you still have ~1/1,000 chances that all samples will be heterozygous. So it is still possible that you will not be able to infer which one is ALLELE_A and which one is ALLELE_B for a few markers. To be safe, I think you need a VCF with at least ~30 samples from independent participants. Otherwise it is just not possible to retrieve this information. Remember that the root of the issue here is that the org that provides your dataset tossed that information away. This is not a limitation of MoChA

freeseek avatar Nov 14 '23 16:11 freeseek

2 1 Hello, Figure 1 is the .vcf file format of .gtc file to conversion which comes from .idat file , which is different from the basic vcf format, could you tell me how to add ALLELE A/ALLELE B/GC/LRR/BAF mentioned in Figure 2?

Tianwen-lab-star avatar Dec 31 '23 16:12 Tianwen-lab-star

BCFtools/gtc2vcf can automatically add ALLELE A/ALLELE B/GC/LRR/BAF when you convert a .gtc file. I have no idea what you refer to when you say basic vcf format. One thing for sure. If a VCF does not have LRR/BAF information, then there is no way to "add" this information

freeseek avatar Jan 02 '24 00:01 freeseek

image image Hello, sorry to bother you. I have another problem. When I perform the shapeit step, it says that there is no AC field. But my VCF file is GTC converted, how should I solve this step?

Tianwen-lab-star avatar Jan 03 '24 09:01 Tianwen-lab-star

SHAPEIT5, differently from SHAPEIT4, requires the AC and AN fields to be filled. You can quickly fill them with either of the following BCFtools commands:

bcftools view -c 0
bcftools +fill-AN-AC

freeseek avatar Jan 03 '24 14:01 freeseek

Thank you. Sounds like 5 is a bit more complicated than 4. I've tried a lot of online methods to make shapeit4, but they didn't success. Could you provide the shapeit4 file that has already been compiled?

Tianwen-lab-star avatar Jan 03 '24 16:01 Tianwen-lab-star

SHAPEIT4 and phase_common from SHAPEIT5 are identical other than requiring the AC and AN fields, with the advantage that SHAPEIT5 can handle trios. You can find binaries for SHAPEIT5 here. In the past to generate binaries for SHAPEIT4 I used the following Dockerfile:

FROM debian:testing-slim
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get -qqy update --fix-missing && \
    apt-get -qqy install --no-install-recommends \
                 wget \
                 g++ \
                 make \
                 libboost-iostreams-dev \
                 libboost-program-options-dev \
                 libhts-dev \
                 libbz2-dev \
                 libssl-dev \
                 libboost-iostreams1.74.0 \
                 libboost-program-options1.74.0 \
                 bcftools && \
    wget --no-check-certificate https://github.com/odelaneau/shapeit4/archive/v4.2.2.tar.gz && \
    tar xzf v4.2.2.tar.gz && \
    cd shapeit4-4.2.2 && \
    sed -i 's/^HTSLIB_INC=\$(HOME)\/Tools\/htslib-1.11$/HTSLIB_INC=-Ihtslib/' makefile && \
    sed -i 's/^HTSLIB_LIB=\$(HOME)\/Tools\/htslib-1.11\/libhts.a$/HTSLIB_LIB=-lhts/' makefile && \
    sed -i 's/^BOOST_LIB_IO=\/usr\/lib\/x86_64-linux-gnu\/libboost_iostreams.a$/BOOST_LIB_IO=-lboost_iostreams/' makefile && \
    sed -i 's/^BOOST_LIB_PO=\/usr\/lib\/x86_64-linux-gnu\/libboost_program_options.a$/BOOST_LIB_PO=-lboost_program_options/' makefile && \
    make && \
    mv bin/shapeit4.2 /usr/bin/ && \
    cd .. && \
    apt-get -qqy purge --auto-remove --option APT::AutoRemove::RecommendsImportant=false \
                 wget \
                 g++ \
                 make \
                 libboost-iostreams-dev \
                 libboost-program-options-dev \
                 libhts-dev \
                 libbz2-dev \
                 libssl-dev && \
    apt-get -qqy clean && \
    rm -rf v4.2.2.tar.gz \
           shapeit4-4.2.2 \
           /var/lib/apt/lists/*

freeseek avatar Jan 03 '24 17:01 freeseek

caf86605762a05ceb165f0876b103e7 Hello, when I use a VCF file to add a ALLELE_A or ALLELE_B, I use the above code and get an error:“Error: BAF format field is not present, cannot infer ALLELE_A or ALLELE_B” VCF files were genotyped and exported by Axiom™ Analysis Suite.

Tianwen-lab-star avatar Mar 15 '24 12:03 Tianwen-lab-star

Your VCF does not include intensity data so it would be pointless to identify which one is the A allele and which one is the B allele. I would advise you to go back to the table data generated by the Affymetrix Power Tools when you genotyped your samples and then use BCFtools/affy2vcf to generate a VCF with BAF, LRR, ALLELE_A, and ALLELE_B. Then you don't have to worry about file formatting issues

freeseek avatar Mar 15 '24 13:03 freeseek