shapeit5 icon indicating copy to clipboard operation
shapeit5 copied to clipboard

Phase_common: AC field is needed in file

Open AkshajD opened this issue 2 years ago • 4 comments

I am running phase_common on a BCF file (as well as variants to exclude in another BCF) acquired from an Axiom array.

While running, it gives an error stating that AC field is needed in file for each of the chromosomes. A sample of this is included below.

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : [email protected]
  * Version       : 5.1.1 / commit = 990ed0d / release = 2023-05-08
  * Run date      : 03/07/2023 - 15:32:31

Files:
  * Input         : [PATH/output/CANPREDDICT.FINAL_GENOTYPING_QC_2021_03_25.AxiomGT1.MoChA.chrX.bcf]
  * Reference     : [PATH/GRCh37/ALL.chrX.phase3_integrated.20130502.genotypes.bcf]
  * Genetic Map   : [PATH/output/genetic_map.chrX.txt]
  * Output        : [PATH/output/CANPREDDICT.FINAL_GENOTYPING_QC_2021_03_25.AxiomGT1.MoChA.chrX.pgt.bcf]
  * Output format : [bcf]

Parameters:
  * Seed    : 15052011
  * Threads : 4 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]
  * HMM     : [window = 4cM / Ne = 15000 / Recombination rates given by genetic map]

Reading genotype data:

ESC[31mERROR: ESC[0mAC field is needed in file

Since I am still fairly new to genotyping pipelines, I am unfamiliar with what the AC field is and how I can make sure it is present in my file (as well as which file (aka the input BCF or one of the references/genetic maps) needs it).

Would appreciate any advice, thank you!

AkshajD avatar Jul 03 '23 20:07 AkshajD

Hi,

AC field is required because it is the Allele Count field, which is used to compute allele frequency and keep only common variants above the selected threshold.

If your input data does not have this AC field, you can compute it using the bcftools fill-tags plugin (https://samtools.github.io/bcftools/howtos/plugin.fill-tags.html)

Best, Robin

RJHFMSTR avatar Jul 10 '23 09:07 RJHFMSTR

Notice that +fill-tags can be up to 10x slower than +fill-AN-AC, so the latter is recommended. You can also use bcftools view -c 0 to fill the AN and AC fields though this recomputes AN and AC only if they are not already present. Why is it that you have to have AN and AC though? Couldn't SHAPEIT5 use the bcf_calc_ac() function from htslib/htslib/vcfutils.h to compute AN and AC? This is a function that would allow to use existing AN and AC values if present and compute them from genotypes otherwise:

/**
 *  bcf_calc_ac() - calculate the number of REF and ALT alleles
 *  @header:  for access to BCF_DT_ID dictionary
 *  @line:    VCF line obtained from vcf_parse1
 *  @ac:      array of length line->n_allele
 *  @which:   determine if INFO/AN,AC and indv fields be used
 *
 *  Returns 1 if the call succeeded, or 0 if the value could not
 *  be determined.
 *
 *  The value of @which determines if existing INFO/AC,AN can be
 *  used (BCF_UN_INFO) and and if indv fields can be split (BCF_UN_FMT).
 */
HTSLIB_EXPORT
int bcf_calc_ac(const bcf_hdr_t *header, bcf1_t *line, int *ac, int which);

And this is how the function is used in bcftools/plugins/fill-AN-AC.c:

bcf1_t *process(bcf1_t *rec)
{
    hts_expand(int,rec->n_allele,marr,arr);
    int ret = bcf_calc_ac(in_hdr,rec,arr,BCF_UN_FMT);
    if ( ret>0 )
    {
        int i, an = 0;
        for (i=0; i<rec->n_allele; i++) an += arr[i];
        bcf_update_info_int32(out_hdr, rec, "AN", &an, 1);
        bcf_update_info_int32(out_hdr, rec, "AC", arr+1, rec->n_allele-1);
    }
    return rec;
}

freeseek avatar Jul 18 '23 17:07 freeseek

I am relatively new to bioinformatics. I would like to do haplotype phasing using phase_common (shapeit5) before carrying out genotype imputation using Beagle 5.4 but I am getting an ERROR: AC field is needed in file. The code I ran is shown below: `[SHAPEIT5] phase_common (jointly phase multiple common markers)

  • Author : Olivier DELANEAU, University of Lausanne

  • Contact : [email protected]

  • Version : 5.1.1 / commit = 5.1.1 / release = 2023-05-16

  • Run date : 17/09/2023 - 11:50:19

Files:

  • Input : [/home/mubita/bioinformatics/input_shapeit/pk_hg19_chr6_tags.bcf]

  • Reference : [/home/mubita/bioinformatics/input_shapeit/chr6.1kg.phase3.v5a.bcf.gz]

  • Genetic Map : [/home/mubita/bioinformatics/input_shapeit/plink.chr6.GRCh37.map.txt]

  • Output : [/home/mubita/bioinformatics/ouput_shapeit/pk_hg19_chr6_tags.phased.bcf]

  • Output format : [bcf]

Parameters:

  • Seed : 15052011

  • Threads : 8 threads

  • MCMC : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]

  • PBWT : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]

  • HMM : [window = 4cM / Ne = 15000 / Recombination rates given by genetic map]

Reading genotype data: ERROR: AC field is needed in file`

I do not understand why this error happened because I filled tags to my input file using the code: bcftools +fill-tags pk_hg19_chr6.bcf -Ob -o pk_hg19_chr6_tags.bcf -- -t all

The header of the vcf file has all the necessary tags e.g. ##fileformat=VCFv4.3

##Contig=<ID=6,length=160443727>

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral allele">

##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency for each ALT allele in the same order as listed (estimated from primary data, not called genotypes)">

##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

##INFO=<ID=ReverseComplementedAlleles,Number=0,Type=Flag,Description="The REF and the ALT alleles have been reverse complemented in liftover since the mapping from the previous reference to the current one was on the negative strand.">

##INFO=<ID=SwappedAlleles,Number=0,Type=Flag,Description="The REF and the ALT alleles have been swapped in liftover due to changes in the reference. It is possible that not all INFO annotations reflect this swap, and in the genotypes, only the GT, PL, and AD fields have been modified. You should check the TAGS_TO_REVERSE parameter that was used during the LiftOver to be sure.">

##contig=<ID=6,length=171115067>

##filedate=20230825

##reference=file:/home/mubita/bioinformatics/input_liftover/Homo_sapiens.GRCh37.dna.chromosome.6.fa

##source=VCF-simplify

In addition, an excerpt of the vcf file before filling the tags looks like this: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 101_Ov 102_Ov 103_Ov 104_Ov 107_Ov 110_Ov 113_Ov 114_Ov 115_Ov 117_Ov 118_Ov 121_Ov 122_Ov 124_Ov 127_Ov 128_Ov 129_Ov 130_Ov 132_Ov 133_Ov 134_Ov 136_Ov 138_Ov 139_Ov 141_Ov 144_Ov 147_Ov 148_Ov 150_Ov 152_Ov 153_Ov 155_Ov 156_Ov 157_Ov 159_Ov 162_Ov 163_Ov 165_Ov 166_Ov 167_Ov 168_Ov 169_Ov 170_Ov 171_Ov 172_Ov 173_Ov 177_Ov 178_Ov 179_Ov 181_Ov 185_Ov 187_NO 188_NO 189_NO 192_NO 194_NO 196_NO 198_NO 199_NO 201_NO 206_NO 207_NO 210_NO 211_NO 213_NO 214_NO 215_NO 216_NO 217_NO 219_NO 223_NO 224_NO 225_NO 226_NO 227_NO 231_NO 234_NO 235_NO 238_NO 239_NO 240_NO 241_NO 244_NO 245_NO 246_NO 248_NO 249_NO 250_NO 254_NO 255_NO

6 160557643 rs2282143 C T . PASS AA=C;AC=33;AF=0.1138;AN=290;NS=145 GT 0/1 0/1 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/1 0/0 0/0 0/1 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/1 0/0 0/0 0/0 0/0 0/1 0/0 0/1 0/0 0/0 1/1 0/0 0/1 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/1 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 ./. 0/1 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0

6 160560881 rs72552763 ATGA A . PASS AA=ATGA;AC=9;AF=0.03061;AN=294;NS=147 GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/1 0/1 0/0 0/0 0/0

Given this information, how can I get rid of this error so that shapeit (phase_common) runs successfully?

mubita3 avatar Sep 17 '23 10:09 mubita3

@mubita3 Not sure if your issue is resolved, but for those still looking for an answer, both your input and reference panel files need the AC field.

ndokmai avatar Dec 15 '23 17:12 ndokmai