bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

bcftools csq stop codon position does not fit

Open MWendorff opened this issue 2 years ago • 10 comments

Hi this is currently the last case of confusion:

In this example the protein lengths are 260 and 1020 amino acids for the two different transcripts. This mutation now effects the stop codon but for some reason it is for ENST00000440794 located at position 260* instead of 261*.

21 33817581 21_33817581_A_G A G 1474 PASS AF=2.5e-05;AQ=1474;ExcessHet=3.0103;QD=14.74;AC=1;AN=2;BCSQ=stop_lost|ITSN1|ENST00000399338|protein_coding|+|1021*>1021W|33817581A>G,stop_lost|ITSN1|ENST00000440794|protein_coding|+|260*>260W|33817581A>G GT:AB:AD:DP:GQ:PL:RNC:SBPV:BCSQ 0|1:0.49:49,51:100:99:1474,0,1356:..:0.2969:10

bcftools csq -p a --ncsq 128 -f Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -O b -o subset_csqProblem6_csq.bcf.gz -g Homo_sapiens.GRCh38.100.gff3.gz subset_csqProblem6.bcf.gz

MWendorff avatar Aug 10 '21 14:08 MWendorff

Are you sure about the position? A quick check at Ensembl suggests that the transcript has 259 aminoacids, so the stop codon would be 260th, and thus annotated correctly: https://www.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA?db=core;g=ENSG00000205726;r=21:33642542-33735093;t=ENST00000440794

pd3 avatar Aug 11 '21 17:08 pd3

This is a screenshot from the link you send starting with the first and ending with the second transcript referred to in the csq output. and the second one is 260 aminoacids, therefore the stop codon should be 261.

So either one of us is too tired to read numbers or there is something strange in the reference that I am not aware of.

Bildschirmfoto 2021-08-12 um 10 45 38

MWendorff avatar Aug 12 '21 08:08 MWendorff

On the other hand, the actual sequence shows that the last aa L is 259th. Similar pattern is for the other transcripts as well, so I think their summary table is incorrect in stating these are the number of aa's; rather, they are the numbers of codons if that makes sense? image

pd3 avatar Aug 12 '21 09:08 pd3

On uniprot it is 260 amino acids long: https://www.uniprot.org/uniprot/H0Y523 As well as in the *.pep.all.fa from http://ftp.ensembl.org/pub/release-100/gff3/homo_sapiens/ used for consequence calling string length is 260.

It really looks like a problem in the reference. What source was your previous message from? A classical problem would be 0-indexing

MWendorff avatar Aug 12 '21 11:08 MWendorff

It's the link in my post above https://github.com/samtools/bcftools/issues/1553#issuecomment-896994364

pd3 avatar Aug 17 '21 11:08 pd3

You are right, no idea why I haven't seen it last week.

As far as I see the problem is a leading X in the first position of the sequence, obviously not considered for consequence calling. Therefore, it is not only the last position but the whole transcript which is indexed one position next to those indices given in the databases and the protein Fasta file downloaded with the reference used for calling.

Sources where the index is based on the leading X:

  1. If you click "Download Sequence" on the page you got your picture from.
  2. on the uniprot page linked on Ensembl: https://www.uniprot.org/uniprot/H0Y523
  3. In the *.pep.all.fa file

The same problem with a leading X occurs all in all in 8816 transcripts of the human genome. I have not checked mutations for all of them but in the first 3 cases the indexing is not as expected.

MWendorff avatar Aug 17 '21 14:08 MWendorff

OK, thanks for exploring this. Can this issue be considered resolved?

pd3 avatar Aug 18 '21 07:08 pd3

I would say it is a not very intuitive for the user to deal with the csq output in those cases. I guess it would be simpler if the sequence indices are fitting with the fasta sequence delivered additional with the reference file. On the other hand I have no idea how complex it would be to consider this directly in bcftools. Once someone is aware of a problem it is treatable but others might also fall into this trap.

Therefore, you have to decide if you want to draw consequences out of this finding or not.

Thanks anyway for your time.

MWendorff avatar Aug 18 '21 11:08 MWendorff

In case of incomplete CDS like this, the GFF file does not give sufficient information to determine the actual protein length. I suspect even the X shown in the resources above is just a placeholder and does not necessarily have to correspond to exactly one amino acid.

A potential improvement might be to mark transcripts with the incomplete CDSs somehow. Therefore I will leave this issue open and flag it as enhancement request

pd3 avatar Aug 19 '21 09:08 pd3