bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

bcftools-1.19 csq: output is not compatible with previous --local-csq format

Open mmokrejs opened this issue 5 months ago • 1 comments

Hi Peter, I wrote a simple parser for the consequences output by the --local-csq code. It turned out the haplotype-aware code outputs way different format. I am guessing there are some pointers to previous lines (see e.g. @988 below) which make processing harder as I would need to cache previous consequence (e.g. missense value). But it also seems two resulting consequences are on a single line. So in summary, the --local-csq has output everything (actualy three) consenquences on a single line. The haplotype aware caller has ouput 7 consequences spanning 3 lines if I am counting correctly (yes, after reformatting through bcftools +split-vep).

$ bcftools csq -c CSQ -f ref.fasta -g ref.gff3 -p m -n 150 -s - testcase.sorted.vcf' | bcftools plugin fill-tags - -- -t 'INFO/AF,INFO/MAF,INFO/TYPE,FORMAT/VAF,FORMAT/VAF1' | bcftools +split-vep -f '%CHROM\t%POS\t%Consequence\t%amino_acid_change\t%dna_change\t%DP\t%ADF\t%ADR\t%AD\t%CSQ\t[ %VAF]\t[ %VAF1]\n' | grep
-v '^#' | head
Parsing ref.gff3 ...
Indexed 1 transcripts, 1 exons, 1 CDSs, 0 UTRs
Calling...
Warning: fewer CSQ fields than expected at ref:989, filling with dots. This warning is printed only once.
ref     988     missense        4T>4H   988A>C+989C>A+990A>C    641908  320881,23,14,8  0,0,0,0 320881,23,14,8  missense|SOMEGENE|12345678|protein_coding|+|4T>4H|988A>C+989C>A+990A>C  7.16676e-05,4.36238e-05,2.49279e-05     0.000140219
ref     989     @988    .       .       641908  320884,25,12,7  0,0,0,0 320884,25,12,7  @988     7.78991e-05,3.73916e-05,2.18117e-05     0.000137102
ref     990     @988    .       .       641466  319802,815,73,44        0,0,0,0 319802,815,73,44        @988     0.00254105,0.000227603,0.000137185      0.00290583
ref     991     missense        5A>5Y   991G>T+992C>A+993T>C    641720  320684,100,34,43        0,0,0,0 320684,100,34,43        missense|SOMEGENE|12345678|protein_coding|+|5A>5Y|991G>T+992C>A+993T>C  0.000311661,0.000105965,0.000134014     0.000551641
ref     992     @991    .       .       641550  320685,82,6,4   0,0,0,0 320685,82,6,4   @991     0.000255629,1.87046e-05,1.24697e-05     0.000286804
ref     993     @991    .       .       641138  320448,37,78,8  0,0,0,0 320448,37,78,8  @991     0.000115419,0.000243316,2.49555e-05     0.00038369
ref     994     missense        6S>6V   994A>G+995G>T+996C>A    641516  320604,48,92,16 0,0,0,0 320604,48,92,16 missense|SOMEGENE|12345678|protein_coding|+|6S>6V|994A>G+995G>T+996C>A  0.000149645,0.000286819,4.98815e-05     0.000486345
ref     995     @994    .       .       641772  320742,133,10,3 0,0,0,0 320742,133,10,3 @994     0.000414475,3.11635e-05,9.34906e-06     0.000454987
ref     996     @994    .       .       641018  320371,115,13,12        0,0,0,0 320371,115,13,12        @994     0.000358802,4.05602e-05,3.74402e-05     0.000436802
ref     997     missense        7H>7S   997C>T+998A>C+999T>G    641906  320448,26,38,6  0,0,0,0 320448,26,38,6  missense|SOMEGENE|12345678|protein_coding|+|7H>7S|997C>T+998A>C+999T>G  8.11187e-05,0.000118558,1.87197e-05     0.000218396

The former --local-csq output was:

$ bcftools csq -c CSQ -f ref.fasta -g ref.gff3 -p m -n 150 -l testcase.sorted.vcf | bcftools plugin fill-tags - -- -t 'INFO/AF,INFO/MAF,INFO/TYPE,FORMAT/VAF,FORMAT/VAF1' | bcftools +split-vep -f '%CHROM\t%POS\t%Consequence\t%amino_acid_change\t%dna_change\t%DP\t%ADF\t%ADR\t%AD\t%CSQ\t[ %VAF]\t[ %VAF1]\n' | grep -v '^#' | head
Parsing ref.gff3 ...
Indexed 1 transcripts, 1 exons, 1 CDSs, 0 UTRs
Calling...
ref     988     missense,missense,missense      4T>4P,4T>4A,4T>4S       988A>C,988A>G,988A>T    641908  320881,23,14,8  0,0,0,0 320881,23,14,8  missense|SOMEGENE|12345678|protein_coding|+|4T>4P|988A>C,missense|SOMEGENE|12345678|protein_coding|+|4T>4A|988A>G,missense|SOMEGENE|12345678|protein_coding|+|4T>4S|988A>T    7.16676e-05,4.36238e-05,2.49279e-05     0.000140219
ref     989     missense,missense,missense      4T>4K,4T>4I,4T>4R       989C>A,989C>T,989C>G    641908  320884,25,12,7  0,0,0,0 320884,25,12,7  missense|SOMEGENE|12345678|protein_coding|+|4T>4K|989C>A,missense|SOMEGENE|12345678|protein_coding|+|4T>4I|989C>T,missense|SOMEGENE|12345678|protein_coding|+|4T>4R|989C>G    7.78991e-05,3.73916e-05,2.18117e-05     0.000137102
ref     990     synonymous,synonymous,synonymous        4T,4T,4T        990A>C,990A>G,990A>T    641466  319802,815,73,44        0,0,0,0 319802,815,73,44        synonymous|SOMEGENE|12345678|protein_coding|+|4T|990A>C,synonymous|SOMEGENE|12345678|protein_coding|+|4T|990A>G,synonymous|SOMEGENE|12345678|protein_coding|+|4T|990A>T       0.00254105,0.000227603,0.000137185      0.00290583
ref     991     missense,missense,missense      5A>5S,5A>5T,5A>5P       991G>T,991G>A,991G>C    641720  320684,100,34,43        0,0,0,0 320684,100,34,43        missense|SOMEGENE|12345678|protein_coding|+|5A>5S|991G>T,missense|SOMEGENE|12345678|protein_coding|+|5A>5T|991G>A,missense|SOMEGENE|12345678|protein_coding|+|5A>5P|991G>C    0.000311661,0.000105965,0.000134014     0.000551641
ref     992     missense,missense,missense      5A>5D,5A>5G,5A>5V       992C>A,992C>G,992C>T    641550  320685,82,6,4   0,0,0,0 320685,82,6,4   missense|SOMEGENE|12345678|protein_coding|+|5A>5D|992C>A,missense|SOMEGENE|12345678|protein_coding|+|5A>5G|992C>G,missense|SOMEGENE|12345678|protein_coding|+|5A>5V|992C>T    0.000255629,1.87046e-05,1.24697e-05     0.000286804
ref     993     synonymous,synonymous,synonymous        5A,5A,5A        993T>C,993T>A,993T>G    641138  320448,37,78,8  0,0,0,0 320448,37,78,8  synonymous|SOMEGENE|12345678|protein_coding|+|5A|993T>C,synonymous|SOMEGENE|12345678|protein_coding|+|5A|993T>A,synonymous|SOMEGENE|12345678|protein_coding|+|5A|993T>G       0.000115419,0.000243316,2.49555e-05     0.00038369
ref     994     missense,missense,missense      6S>6G,6S>6R,6S>6C       994A>G,994A>C,994A>T    641516  320604,48,92,16 0,0,0,0 320604,48,92,16 missense|SOMEGENE|12345678|protein_coding|+|6S>6G|994A>G,missense|SOMEGENE|12345678|protein_coding|+|6S>6R|994A>C,missense|SOMEGENE|12345678|protein_coding|+|6S>6C|994A>T    0.000149645,0.000286819,4.98815e-05     0.000486345
ref     995     missense,missense,missense      6S>6I,6S>6N,6S>6T       995G>T,995G>A,995G>C    641772  320742,133,10,3 0,0,0,0 320742,133,10,3 missense|SOMEGENE|12345678|protein_coding|+|6S>6I|995G>T,missense|SOMEGENE|12345678|protein_coding|+|6S>6N|995G>A,missense|SOMEGENE|12345678|protein_coding|+|6S>6T|995G>C    0.000414475,3.11635e-05,9.34906e-06     0.000454987
ref     996     missense,synonymous,missense    6S>6R,6S,6S>6R  996C>A,996C>T,996C>G    641018  320371,115,13,12        0,0,0,0 320371,115,13,12        missense|SOMEGENE|12345678|protein_coding|+|6S>6R|996C>A,synonymous|SOMEGENE|12345678|protein_coding|+|6S|996C>T,missense|SOMEGENE|12345678|protein_coding|+|6S>6R|996C>G     0.000358802,4.05602e-05,3.74402e-05     0.000436802
ref     997     missense,missense,missense      7H>7Y,7H>7N,7H>7D       997C>T,997C>A,997C>G    641906  320448,26,38,6  0,0,0,0 320448,26,38,6  missense|SOMEGENE|12345678|protein_coding|+|7H>7Y|997C>T,missense|SOMEGENE|12345678|protein_coding|+|7H>7N|997C>A,missense|SOMEGENE|12345678|protein_coding|+|7H>7D|997C>G    8.11187e-05,0.000118558,1.87197e-05     0.000218396

Would it be possible to mimic the previous output format?

I don't understand the dots in the 4th and 5th column of the haplotype-aware results and how to work with the supposedly "next data" section in columns 10 to 12. I know this is a TSV output from bcftools +split-vep but I hope you get my point. You have this particular testcase data in your email from yesterday. Am I just misunderstanding the output or have unreal expectations of the output format?

mmokrejs avatar Jan 09 '24 12:01 mmokrejs