bcftools
bcftools copied to clipboard
bcftools-1.19 csq: output is not compatible with previous --local-csq format
Hi Peter,
I wrote a simple parser for the consequences output by the --local-csq
code. It turned out the haplotype-aware code outputs way different format. I am guessing there are some pointers to previous lines (see e.g. @988
below) which make processing harder as I would need to cache previous consequence (e.g. missense
value). But it also seems two resulting consequences are on a single line. So in summary, the --local-csq
has output everything (actualy three) consenquences on a single line. The haplotype aware caller has ouput 7 consequences spanning 3 lines if I am counting correctly (yes, after reformatting through bcftools +split-vep
).
$ bcftools csq -c CSQ -f ref.fasta -g ref.gff3 -p m -n 150 -s - testcase.sorted.vcf' | bcftools plugin fill-tags - -- -t 'INFO/AF,INFO/MAF,INFO/TYPE,FORMAT/VAF,FORMAT/VAF1' | bcftools +split-vep -f '%CHROM\t%POS\t%Consequence\t%amino_acid_change\t%dna_change\t%DP\t%ADF\t%ADR\t%AD\t%CSQ\t[ %VAF]\t[ %VAF1]\n' | grep
-v '^#' | head
Parsing ref.gff3 ...
Indexed 1 transcripts, 1 exons, 1 CDSs, 0 UTRs
Calling...
Warning: fewer CSQ fields than expected at ref:989, filling with dots. This warning is printed only once.
ref 988 missense 4T>4H 988A>C+989C>A+990A>C 641908 320881,23,14,8 0,0,0,0 320881,23,14,8 missense|SOMEGENE|12345678|protein_coding|+|4T>4H|988A>C+989C>A+990A>C 7.16676e-05,4.36238e-05,2.49279e-05 0.000140219
ref 989 @988 . . 641908 320884,25,12,7 0,0,0,0 320884,25,12,7 @988 7.78991e-05,3.73916e-05,2.18117e-05 0.000137102
ref 990 @988 . . 641466 319802,815,73,44 0,0,0,0 319802,815,73,44 @988 0.00254105,0.000227603,0.000137185 0.00290583
ref 991 missense 5A>5Y 991G>T+992C>A+993T>C 641720 320684,100,34,43 0,0,0,0 320684,100,34,43 missense|SOMEGENE|12345678|protein_coding|+|5A>5Y|991G>T+992C>A+993T>C 0.000311661,0.000105965,0.000134014 0.000551641
ref 992 @991 . . 641550 320685,82,6,4 0,0,0,0 320685,82,6,4 @991 0.000255629,1.87046e-05,1.24697e-05 0.000286804
ref 993 @991 . . 641138 320448,37,78,8 0,0,0,0 320448,37,78,8 @991 0.000115419,0.000243316,2.49555e-05 0.00038369
ref 994 missense 6S>6V 994A>G+995G>T+996C>A 641516 320604,48,92,16 0,0,0,0 320604,48,92,16 missense|SOMEGENE|12345678|protein_coding|+|6S>6V|994A>G+995G>T+996C>A 0.000149645,0.000286819,4.98815e-05 0.000486345
ref 995 @994 . . 641772 320742,133,10,3 0,0,0,0 320742,133,10,3 @994 0.000414475,3.11635e-05,9.34906e-06 0.000454987
ref 996 @994 . . 641018 320371,115,13,12 0,0,0,0 320371,115,13,12 @994 0.000358802,4.05602e-05,3.74402e-05 0.000436802
ref 997 missense 7H>7S 997C>T+998A>C+999T>G 641906 320448,26,38,6 0,0,0,0 320448,26,38,6 missense|SOMEGENE|12345678|protein_coding|+|7H>7S|997C>T+998A>C+999T>G 8.11187e-05,0.000118558,1.87197e-05 0.000218396
The former --local-csq
output was:
$ bcftools csq -c CSQ -f ref.fasta -g ref.gff3 -p m -n 150 -l testcase.sorted.vcf | bcftools plugin fill-tags - -- -t 'INFO/AF,INFO/MAF,INFO/TYPE,FORMAT/VAF,FORMAT/VAF1' | bcftools +split-vep -f '%CHROM\t%POS\t%Consequence\t%amino_acid_change\t%dna_change\t%DP\t%ADF\t%ADR\t%AD\t%CSQ\t[ %VAF]\t[ %VAF1]\n' | grep -v '^#' | head
Parsing ref.gff3 ...
Indexed 1 transcripts, 1 exons, 1 CDSs, 0 UTRs
Calling...
ref 988 missense,missense,missense 4T>4P,4T>4A,4T>4S 988A>C,988A>G,988A>T 641908 320881,23,14,8 0,0,0,0 320881,23,14,8 missense|SOMEGENE|12345678|protein_coding|+|4T>4P|988A>C,missense|SOMEGENE|12345678|protein_coding|+|4T>4A|988A>G,missense|SOMEGENE|12345678|protein_coding|+|4T>4S|988A>T 7.16676e-05,4.36238e-05,2.49279e-05 0.000140219
ref 989 missense,missense,missense 4T>4K,4T>4I,4T>4R 989C>A,989C>T,989C>G 641908 320884,25,12,7 0,0,0,0 320884,25,12,7 missense|SOMEGENE|12345678|protein_coding|+|4T>4K|989C>A,missense|SOMEGENE|12345678|protein_coding|+|4T>4I|989C>T,missense|SOMEGENE|12345678|protein_coding|+|4T>4R|989C>G 7.78991e-05,3.73916e-05,2.18117e-05 0.000137102
ref 990 synonymous,synonymous,synonymous 4T,4T,4T 990A>C,990A>G,990A>T 641466 319802,815,73,44 0,0,0,0 319802,815,73,44 synonymous|SOMEGENE|12345678|protein_coding|+|4T|990A>C,synonymous|SOMEGENE|12345678|protein_coding|+|4T|990A>G,synonymous|SOMEGENE|12345678|protein_coding|+|4T|990A>T 0.00254105,0.000227603,0.000137185 0.00290583
ref 991 missense,missense,missense 5A>5S,5A>5T,5A>5P 991G>T,991G>A,991G>C 641720 320684,100,34,43 0,0,0,0 320684,100,34,43 missense|SOMEGENE|12345678|protein_coding|+|5A>5S|991G>T,missense|SOMEGENE|12345678|protein_coding|+|5A>5T|991G>A,missense|SOMEGENE|12345678|protein_coding|+|5A>5P|991G>C 0.000311661,0.000105965,0.000134014 0.000551641
ref 992 missense,missense,missense 5A>5D,5A>5G,5A>5V 992C>A,992C>G,992C>T 641550 320685,82,6,4 0,0,0,0 320685,82,6,4 missense|SOMEGENE|12345678|protein_coding|+|5A>5D|992C>A,missense|SOMEGENE|12345678|protein_coding|+|5A>5G|992C>G,missense|SOMEGENE|12345678|protein_coding|+|5A>5V|992C>T 0.000255629,1.87046e-05,1.24697e-05 0.000286804
ref 993 synonymous,synonymous,synonymous 5A,5A,5A 993T>C,993T>A,993T>G 641138 320448,37,78,8 0,0,0,0 320448,37,78,8 synonymous|SOMEGENE|12345678|protein_coding|+|5A|993T>C,synonymous|SOMEGENE|12345678|protein_coding|+|5A|993T>A,synonymous|SOMEGENE|12345678|protein_coding|+|5A|993T>G 0.000115419,0.000243316,2.49555e-05 0.00038369
ref 994 missense,missense,missense 6S>6G,6S>6R,6S>6C 994A>G,994A>C,994A>T 641516 320604,48,92,16 0,0,0,0 320604,48,92,16 missense|SOMEGENE|12345678|protein_coding|+|6S>6G|994A>G,missense|SOMEGENE|12345678|protein_coding|+|6S>6R|994A>C,missense|SOMEGENE|12345678|protein_coding|+|6S>6C|994A>T 0.000149645,0.000286819,4.98815e-05 0.000486345
ref 995 missense,missense,missense 6S>6I,6S>6N,6S>6T 995G>T,995G>A,995G>C 641772 320742,133,10,3 0,0,0,0 320742,133,10,3 missense|SOMEGENE|12345678|protein_coding|+|6S>6I|995G>T,missense|SOMEGENE|12345678|protein_coding|+|6S>6N|995G>A,missense|SOMEGENE|12345678|protein_coding|+|6S>6T|995G>C 0.000414475,3.11635e-05,9.34906e-06 0.000454987
ref 996 missense,synonymous,missense 6S>6R,6S,6S>6R 996C>A,996C>T,996C>G 641018 320371,115,13,12 0,0,0,0 320371,115,13,12 missense|SOMEGENE|12345678|protein_coding|+|6S>6R|996C>A,synonymous|SOMEGENE|12345678|protein_coding|+|6S|996C>T,missense|SOMEGENE|12345678|protein_coding|+|6S>6R|996C>G 0.000358802,4.05602e-05,3.74402e-05 0.000436802
ref 997 missense,missense,missense 7H>7Y,7H>7N,7H>7D 997C>T,997C>A,997C>G 641906 320448,26,38,6 0,0,0,0 320448,26,38,6 missense|SOMEGENE|12345678|protein_coding|+|7H>7Y|997C>T,missense|SOMEGENE|12345678|protein_coding|+|7H>7N|997C>A,missense|SOMEGENE|12345678|protein_coding|+|7H>7D|997C>G 8.11187e-05,0.000118558,1.87197e-05 0.000218396
Would it be possible to mimic the previous output format?
I don't understand the dots in the 4th and 5th column of the haplotype-aware results and how to work with the supposedly "next data" section in columns 10 to 12. I know this is a TSV output from bcftools +split-vep
but I hope you get my point. You have this particular testcase data in your email from yesterday. Am I just misunderstanding the output or have unreal expectations of the output format?