Unnecesary warning during liftover of each variant?
When runing a liftover+ from 37 to 38, a warning is displayed for each variant (N=17697744/). If I ignore the warnings and let it continue, the liftover runs to completion and creates a b38 vcf. The command line is pointing to 2 references with -s and -f so the warning is not expected. The liftover does seem to be swapping alleles and rejecting variants. Any ideas on how to eliminate this warning? Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the reference allele at position chr22:40552657 Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the reference allele at position chr22:40554777 Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the reference allele at position chr22:40559560 Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the reference allele at position chr22:40568384 Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the reference allele at position chr22:40611472 Lines total/swapped/reference added/rejected: 17697744/10897/8803/2421 Merging 9 temporary files Cleaning Done
You need both the b37 (-s) and the b38 (-f) references as input, or BCFtools/liftover will not be able to handle some variants properly. For SNP this might not be too important (though some will be lost because of this) but it is still highly recommended to provide both options. I could add a fix to make sure the warning is not displayed too many times
But I am using both options (-s -f) .
bcftools +liftover --no-version $VCF37 -- \
-s /restricted/projectnb/casa/ref/human_g1k_v37_decoy.fasta \
-f $REF_DIR/GRCh38_full_analysis_set_plus_decoy_hla.fa \
-c $REF_DIR/GRCh37_to_GRCh38.chain.gz \
--reject $VCF_reject\
--reject-type b \
--write-src | \
bcftools sort -Oz -o $VCF38
There warning is generated in the code below in liftover.c. What is this code trying to do? What are maximally extended Alleles?
static int find_reference(bcf1_t *rec, const bcf_hdr_t *hdr, const char *ref, int is_snp, kstring_t *str) {
int i, swap = -1;
int ref_len = strlen(ref);
for (i = 0; i < rec->n_allele; i++) {
if (rec->d.allele[i][0] == '*') continue;
int len = strlen(rec->d.allele[i]);
if (len != ref_len) continue;
if (ref_len == 1 ? (ref[0] == rec->d.allele[i][0]) : !strncasecmp(ref + 1, rec->d.allele[i] + 1, len - 2)) {
if (swap >= 0) {
// two matches can only happen if alleles were not maximally extended
fprintf(stderr,
"Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the "
"reference allele at position %s:%" PRIhts_pos "\n",
bcf_seqname(hdr, rec), rec->pos + 1);
break;
} else {
swap = i;
}
}
}
Can you show me some examples of a variants that are giving you this warning?
I ran into the same problem.
bcftools +liftover -vv -Ou snps_hg38.vcf.gz -- -s GCA_000001405.15_GRCh38_no_alt_analysis_set.fna -f human_g1k_v37.fasta -c GRCh38_to_GRCh37.chain.gz --reject reject.vcf --reject-type u --write-src | bcftools sort -o output.vcf -Oz plugin directory /usr/libexec/bcftools .. ok /usr/libexec/bcftools/liftover.so: plugin open .. ok init .. ok Writing to /tmp/bcftools.0Xiau4 INFO/QS is handled by AGR rule FORMAT/PL is handled by AGR rule Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the reference allele at position Y:6134697
chrY 6134697 MF174079 G A 0 PASS DP=100;NS=2;QS=0,1;MQ0F=1;AA=G;HG="O";ISOGG="O" GT:PL 0/0:0,0,255 1/1:255,0,0
I have tried to liftover that variant with BCFtools/liftover 2024-09-27 and it works fine:
echo -e "##fileformat=VCFv4.2\n##contig=<ID=chrY>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\nchrY\t6134697\t.\tG\tA\t.\t.\t." | \
bcftools +liftover -- -s GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
-f human_g1k_v37.fasta \
-c GRCh38_to_GRCh37.chain.gz | \
bcftools view -H
Y 6002738 . G A . . .
What version of BCFtools/liftover are you using?
bcftools +liftover -V bcftools 1.21 using htslib 1.21 plugin at 1.21 using htslib 1.21
The original VCF is here:
https://ybrowse.org/gbrowse2/gff/snps_hg38.vcf.gz
It is because you have miscoded monomorphic variants:
bcftools annotate -x ID,PASS,INFO -t chrY:6266656-6266656 snps_hg38.vcf.gz | bcftools view -GH
chrY 6266656 . G G 0 PASS .
chrY 6266656 . G T 0 PASS .
Indeed if you try to use BCFtools/norm it will tell you:
bcftools norm -f GCA_000001405.15_GRCh38_no_alt_analysis_set.fna snps_hg38.vcf.gz -t chrY:6266656-6266656 -o /dev/null
Duplicate alleles at chrY:6266656; run with -cw to turn the error into warning or with -cs to fix.
I will add a better error message in a future release, but for now you should first fix your VCF until BCFtools/norm finds it compliant
Thank you!