bcftools
bcftools copied to clipboard
Conversion from standard VCF to Local-alleles VCF
This is a request for improved documentation.
Doing a hunt through the documentation I find local alleles mentioned in one section only:
$ egrep -n -i "local.*allele" ../bcftools/doc/*.txt
2008:*-L, --local-alleles* 'INT'::
2012: and alternate alleles. The *-L, --local-alleles* option allows replacement of such tags
2013: with a localized tag (FORMAT/LPL) which only includes a subset of alternate alleles relevant
This is the -L option of bcftools merge
. The usage for this claims:
-L, --local-alleles INT If more than INT alt alleles are encountered, drop FMT/PL and output LAA+LPL instead; 0=unlimited [0]
It's unclear how we can use this as a conversion tool for an existing file. I can see two possible routes.
-
Take an existing many-sample VCF and split into potentially thousands of per-sample files, and then merge them back together again. This seems excessive and probably extremely time consuming. Plus unlike samtools, there's no
bcftools split
subcommand, so I assume this would be many passes through the data withbcftools view -S
which feels like it'll end up be O(N^2) complexity in the number of samples. Or am I missing a split command under a different name? [Edit: yes - see below. I found a plugin later on] -
Merging the existing VCF with an empty VCF. I'm unsure of how to go about this.
I tried method 2 initially as it seemed the most viable, but immediately hit an error.
$ echo '##fileformat=VCFv4.2' > _
$ bcftools merge -L 0 /nfs/users/nfs_j/jkb/scratch/data/gnomad-g1k.10k.vcf _
Error: "--local-alleles 0" makes no sense, expected value bigger or equal than 1
Yet the usage statement claims 0 means unlimited. (And confusingly implies it's the default via [0]
, although it has no default as it's a required argument.)
Not deterred, I soldiered on and took it at its word and went with 1.
[Incidentally - a feature request. Please permit tools to work on uncompressed data when we're reading the entire file. It's tedious to have to compress and index everything when I know everything is in the same order already.]
I also hit problems with not being sure what a valid empty VCF file looks like. Is there a command to take the header from an existing VCF and strip off all the samples, so we get an empty one? I did it with egrep and echoes in the end.
After compressing and indexing that, my merge now produces LAA in the header, but as far as I can see it's totally absent in the output file.
So am I correct in assuming now that it's not possible by my guess number 2 above, and the only route to producing LAA is a full sample by sample split and then a remerge back together again?
The back story to all of this is I just want to be able to produce a baseline from an existing merged file that's not using LAA to see how much better it can get, before looking into other formats and other tools. Help would be appreciated, and I'm sure it's not just beneficial to me (hence why putting the issue here rather than direct email).