score icon indicating copy to clipboard operation
score copied to clipboard

Unnecesary warning during liftover of each variant?

Open jjfarrell opened this issue 1 year ago • 8 comments

When runing a liftover+ from 37 to 38, a warning is displayed for each variant (N=17697744/). If I ignore the warnings and let it continue, the liftover runs to completion and creates a b38 vcf. The command line is pointing to 2 references with -s and -f so the warning is not expected. The liftover does seem to be swapping alleles and rejecting variants. Any ideas on how to eliminate this warning? Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the reference allele at position chr22:40552657 Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the reference allele at position chr22:40554777 Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the reference allele at position chr22:40559560 Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the reference allele at position chr22:40568384 Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the reference allele at position chr22:40611472 Lines total/swapped/reference added/rejected: 17697744/10897/8803/2421 Merging 9 temporary files Cleaning Done

jjfarrell avatar Dec 21 '24 13:12 jjfarrell

You need both the b37 (-s) and the b38 (-f) references as input, or BCFtools/liftover will not be able to handle some variants properly. For SNP this might not be too important (though some will be lost because of this) but it is still highly recommended to provide both options. I could add a fix to make sure the warning is not displayed too many times

freeseek avatar Dec 21 '24 14:12 freeseek

But I am using both options (-s -f) .

bcftools +liftover --no-version  $VCF37  -- \
  -s  /restricted/projectnb/casa/ref/human_g1k_v37_decoy.fasta \
  -f $REF_DIR/GRCh38_full_analysis_set_plus_decoy_hla.fa \
  -c $REF_DIR/GRCh37_to_GRCh38.chain.gz \
  --reject  $VCF_reject\
  --reject-type b \
  --write-src | \
  bcftools sort -Oz -o $VCF38

There warning is generated in the code below in liftover.c. What is this code trying to do? What are maximally extended Alleles?

static int find_reference(bcf1_t *rec, const bcf_hdr_t *hdr, const char *ref, int is_snp, kstring_t *str) {
    int i, swap = -1;
    int ref_len = strlen(ref);
    for (i = 0; i < rec->n_allele; i++) {
        if (rec->d.allele[i][0] == '*') continue;
        int len = strlen(rec->d.allele[i]);
        if (len != ref_len) continue;
        if (ref_len == 1 ? (ref[0] == rec->d.allele[i][0]) : !strncasecmp(ref + 1, rec->d.allele[i] + 1, len - 2)) {
            if (swap >= 0) {
                // two matches can only happen if alleles were not maximally extended
                fprintf(stderr,
                        "Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the "
                        "reference allele at position %s:%" PRIhts_pos "\n",
                        bcf_seqname(hdr, rec), rec->pos + 1);
                break;
            } else {
                swap = i;
            }
        }
    }

jjfarrell avatar Dec 21 '24 18:12 jjfarrell

Can you show me some examples of a variants that are giving you this warning?

freeseek avatar Jan 06 '25 18:01 freeseek

I ran into the same problem.

bcftools +liftover -vv -Ou snps_hg38.vcf.gz -- -s GCA_000001405.15_GRCh38_no_alt_analysis_set.fna -f human_g1k_v37.fasta -c GRCh38_to_GRCh37.chain.gz --reject reject.vcf --reject-type u --write-src | bcftools sort -o output.vcf -Oz plugin directory /usr/libexec/bcftools .. ok /usr/libexec/bcftools/liftover.so: plugin open .. ok init .. ok Writing to /tmp/bcftools.0Xiau4 INFO/QS is handled by AGR rule FORMAT/PL is handled by AGR rule Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the reference allele at position Y:6134697

chrY 6134697 MF174079 G A 0 PASS DP=100;NS=2;QS=0,1;MQ0F=1;AA=G;HG="O";ISOGG="O" GT:PL 0/0:0,0,255 1/1:255,0,0

teepean avatar Mar 07 '25 12:03 teepean

I have tried to liftover that variant with BCFtools/liftover 2024-09-27 and it works fine:

echo -e "##fileformat=VCFv4.2\n##contig=<ID=chrY>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\nchrY\t6134697\t.\tG\tA\t.\t.\t." | \
  bcftools +liftover -- -s GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  -f human_g1k_v37.fasta \
  -c GRCh38_to_GRCh37.chain.gz | \
  bcftools view -H
Y	6002738	.	G	A	.	.	.

What version of BCFtools/liftover are you using?

freeseek avatar Mar 07 '25 17:03 freeseek

bcftools +liftover -V bcftools 1.21 using htslib 1.21 plugin at 1.21 using htslib 1.21

The original VCF is here:

https://ybrowse.org/gbrowse2/gff/snps_hg38.vcf.gz

teepean avatar Mar 07 '25 17:03 teepean

It is because you have miscoded monomorphic variants:

bcftools annotate -x ID,PASS,INFO -t chrY:6266656-6266656 snps_hg38.vcf.gz | bcftools view -GH
chrY	6266656	.	G	G	0	PASS	.
chrY	6266656	.	G	T	0	PASS	.

Indeed if you try to use BCFtools/norm it will tell you:

bcftools norm -f GCA_000001405.15_GRCh38_no_alt_analysis_set.fna snps_hg38.vcf.gz -t chrY:6266656-6266656 -o /dev/null
Duplicate alleles at chrY:6266656; run with -cw to turn the error into warning or with -cs to fix.

I will add a better error message in a future release, but for now you should first fix your VCF until BCFtools/norm finds it compliant

freeseek avatar Mar 07 '25 17:03 freeseek

Thank you!

teepean avatar Mar 07 '25 18:03 teepean