nano-snakemake icon indicating copy to clipboard operation
nano-snakemake copied to clipboard

Problem in the sort_vcf rule

Open tdido opened this issue 5 years ago • 2 comments

Hi Wouter.

This may not necessarily be a problem with the pipeline, but maybe you can give me a hand with it.

Most of the vcf files can be sorted without problems, but I'm getting an error with one that comes from "pbsv_combined". You can find the file here: https://file.io/FfYklD

The error I'm getting is this:

[E::vcf_parse_format] Incorrect number of FORMAT fields at GL000208.1:1

It's directly reproducible by running

bcftools sort genotypes.vcf

Any tips on what's going on would be greatly appreciated.

tdido avatar Jan 29 '20 14:01 tdido

I think SURVIVOR is to blame here (tagging @fritzsedlazeck). In the FORMAT field of a BND/TRA position you get:

GT:PSV:LN:DR:ST:QV:TY:ID:RAL:AAL:CO
0/1:NA:47079723:0,0:++:.:TRA:pbsv.BND.GL000208.1:1-chr5:47079724:NA:NA:GL000208.1_1-chr5_47079724

Note that the ID part of the FORMAT field also contains : for the coordinates, which is actually the delimiter in the FORMAT field.

You can take a look at this in your own data with for example things like this:

cat genotypes.vcf | grep -v '^#' | grep GL000208.1 | head -n 1 | cut -f9,12 | tr '\t' '\n' | tr ':' '\t' | column -ts $'\t' | less -S

wdecoster avatar Jan 30 '20 07:01 wdecoster

Right, I see it now. Let's see what Fritz thinks about it. Thanks!

tdido avatar Jan 30 '20 08:01 tdido