open-cravat
open-cravat copied to clipboard
open-cravat octopus vs strelka
Similar to #63 but not hg19 specific. I have a BAM on which I've used both
https://github.com/luntergroup/octopus and https://github.com/Illumina/strelka
They agree on what they found (deletion of TG & deletion of TGTG) but disagree on how to represent that in the output. In particular
strelka says
chr7 74053320 . CTGTG CTG,C 278 PASS CIGAR=1M2D2M,1M4D;RU=TG,TG;REFREP=19,19;IDREP=18,17;MQ=60 GT:GQ:GQX:DPI:AD:ADF:ADR:FT:PL 1/2:65:19:36:1,12,9:0,2,1:1,10,8:PASS:282,102,66,142,0,123
while octopus instead writes that as 2 rows
chr7 74053320 . CTG C,* 266.45 PASS AC=1,1;AN=2;DP=37;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 2|1:93:37:59:74053320:100:PASS
chr7 74053320 . CTGTG C,* 48.37 PASS AC=1,1;AN=2;DP=37;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|2:48:37:59:74053320:100:PASS
the octopus team seems to view their representation as a feature, not a bug https://github.com/luntergroup/octopus#output-format
however numerous parts of opencravat behave quite differently when processing these. Here are some of the differences I've noted.
{
"base__all_mappings": [
"{\"ELN\": [[\"P15502\", \"\", \"INT\", \"ENST00000252034.12\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000320399.10\", \"c.1096+47_1096+50del\"], [\"\", \"\", \"INT\", \"ENST00000320492.11\", \"c.988+47_988+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000357036.9\", \"c.1111+47_1111+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000358929.8\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380553.8\", \"c.745+47_745+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380562.8\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380575.8\", \"c.1066+47_1066+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380576.9\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380584.8\", \"c.1054+47_1054+50del\"], [\"\", \"\", \"INT\", \"ENST00000414324.5\", \"c.1081+47_1081+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000429192.5\", \"c.1111+47_1111+50del\"], [\"\", \"\", \"INT\", \"ENST00000445912.5\", \"c.1096+47_1096+50del\"], [\"\", \"\", \"INT\", \"ENST00000458204.5\", \"c.1066+47_1066+50del\"], [\"\", \"\", \"INT,PTR\", \"ENST00000466878.5\", \"\"], [\"\", \"\", \"INT\", \"ENST00000621115.4\", \"c.964+47_964+50del\"]]}",
"{\"ELN\": [[\"P15502\", \"\", \"INT\", \"ENST00000252034.12\", \"c.1096+49_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000320399.10\", \"c.1096+49_1096+50del\"], [\"\", \"\", \"INT\", \"ENST00000320492.11\", \"c.988+49_988+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000357036.9\", \"c.1111+49_1111+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000358929.8\", \"c.1096+49_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380553.8\", \"c.745+49_745+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380562.8\", \"c.1096+49_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380575.8\", \"c.1066+49_1066+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380576.9\", \"c.1096+49_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380584.8\", \"c.1054+49_1054+50del\"], [\"\", \"\", \"INT\", \"ENST00000414324.5\", \"c.1081+49_1081+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000429192.5\", \"c.1111+49_1111+50del\"], [\"\", \"\", \"INT\", \"ENST00000445912.5\", \"c.1096+49_1096+50del\"], [\"\", \"\", \"INT\", \"ENST00000458204.5\", \"c.1066+49_1066+50del\"], [\"\", \"\", \"INT,PTR\", \"ENST00000466878.5\", \"\"], [\"\", \"\", \"INT\", \"ENST00000621115.4\", \"c.964+49_964+50del\"]]}",
"{\"ELN\": [[\"P15502\", \"\", \"INT\", \"ENST00000252034.12\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000320399.10\", \"c.1096+47_1096+50del\"], [\"\", \"\", \"INT\", \"ENST00000320492.11\", \"c.988+47_988+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000357036.9\", \"c.1111+47_1111+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000358929.8\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380553.8\", \"c.745+47_745+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380562.8\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380575.8\", \"c.1066+47_1066+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380576.9\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380584.8\", \"c.1054+47_1054+50del\"], [\"\", \"\", \"INT\", \"ENST00000414324.5\", \"c.1081+47_1081+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000429192.5\", \"c.1111+47_1111+50del\"], [\"\", \"\", \"INT\", \"ENST00000445912.5\", \"c.1096+47_1096+50del\"], [\"\", \"\", \"INT\", \"ENST00000458204.5\", \"c.1066+47_1066+50del\"], [\"\", \"\", \"INT,PTR\", \"ENST00000466878.5\", \"\"], [\"\", \"\", \"INT\", \"ENST00000621115.4\", \"c.964+47_964+50del\"]]}"
],
"base__cchange": [
"c.1096+47_1096+50del",
"c.1096+49_1096+50del",
"c.1096+47_1096+50del"
],
"base__ref_base": [
"TGTG",
"TG",
"TGTG"
],
"base__uid": [
2292288,
2229240,
2229241
],
"clinvar__disease_names": [
"not specified",
"Supravalvar aortic stenosis|Cutis laxa, autosomal dominant",
"not specified"
],
"clinvar__disease_refs": [
"MedGen:CN169374",
"Human Phenotype Ontology:HP:0004381,MONDO:MONDO:0008504,MedGen:C0003499,OMIM:185500,Orphanet:ORPHA3193,SNOMED CT:268185002|MONDO:MONDO:0019571,MedGen:C0268350,Orphanet:ORPHA90348,SNOMED CT:111388003",
"MedGen:CN169374"
],
"clinvar__id": [
"402825",
"360646",
"402825"
],
"clinvar__sig": [
"Benign",
"Uncertain significance",
"Benign"
],
"extra_vcf_info__AC": [
null,
1,
1
],
"extra_vcf_info__AN": [
null,
2,
2
],
"extra_vcf_info__CIGAR": [
"1M4D",
null,
null
],
"extra_vcf_info__DP": [
null,
37,
37
],
"extra_vcf_info__IDREP": [
17,
null,
null
],
"extra_vcf_info__MQ": [
60,
59,
59
],
"extra_vcf_info__NS": [
null,
1,
1
],
"extra_vcf_info__REFREP": [
19,
null,
null
],
"extra_vcf_info__RU": [
"TG",
null,
null
],
"extra_vcf_info__ref": [
"CTGTG",
"CTG",
"CTGTG"
],
"gnomad3__af": [
0.543651,
0.0472857,
0.543651
],
"gnomad3__af_afr": [
0.48686,
0.0873048,
0.48686
],
"gnomad3__af_asj": [
0.604372,
0.0314233,
0.604372
],
"gnomad3__af_eas": [
0.569697,
0.054698,
0.569697
],
"gnomad3__af_fin": [
0.573476,
0.0422155,
0.573476
],
"gnomad3__af_lat": [
0.602031,
0.0405933,
0.602031
],
"gnomad3__af_nfe": [
0.558495,
0.025235,
0.558495
],
"gnomad3__af_oth": [
0.55107,
0.0501949,
0.55107
],
"gnomad3__af_sas": [
0.547005,
0.044838,
0.547005
],
"vcfinfo__af": [
"0.4090909090909091",
null,
null
],
"vcfinfo__alt_reads": [
"9",
null,
null
],
"vcfinfo__phred": [
"278",
"266.45",
"48.37"
],
"vcfinfo__tot_reads": [
"22",
"37",
"37"
]
}```
Does the opencravat team feel that octopus is "just plain wrong"? "not supported"? "cutting edge"? something else?
Do these different opencravat outputs reflect a limitation of opencravat?
If so, does it seem fixable? or is it an unavoidable limitation of opencravat architecture?
am I going crazy? there was useful comment here from @rkimoakbioinformatics about the 3' rule https://varnomen.hgvs.org/recommendations/checklist/#:~:text=The%203'%20rule%20%2D%20do%20you,amino%20acid
@cariaso Sorry, I deleted my comments because I was afraid of possibly giving confusing information and wanted to write a clearer one again. I'll write below.
OpenCRAVAT normalizes input variants with the 3' rule and then feed the normalized variants to annotators. Thus,
- The first input: strelka input
chr7:74053320:CTGTG>CTGbecomeschr7:74053323:TGdelafter normalization with the 3' rule. octopus inputchr7:74053320:CTG>Cbecomeschr7:74053321:TGdelafter normalization with the 3' rule. Although these two variants look different after normalization with the 3' rule, when the hg38 mapper processes them, the mapper looks at the bases around the variants to apply the 3' rule further. This step by the mapper ensures that correct HGVS-format consequences are found. As a result, OpenCRAVAT produces the same variant consequences for the both variants (e.g.ENST00000252034.12 intron_variant c.1096+49_1096+50del). - The second input: strelka input
chr7:74053320:CTGTG>Cbecomeschr7:74053321:TGTGdelafter normalization with the 3' rule. octopus inputchr7:74053320:CTGTG>Cbecomeschr7:74053321:TGTGdelafter normalization with the 3' rule. Both are the same and OpenCRAVAT produces the same variant consequences (e.g.ENST00000252034.12 intron_variant c.1096+47_1096+50del).
As I said, I am not sure if this is an answer to your question.
I am not sure if this is an answer to your question.
It's not a final answer, but it definitely helps. I'm finding it a bit difficult to keep these straight, so I'm assigning names S1, S2, O1, O2 and attempting to restate what you've written. I'm not agreeing or disagreeing.
strelka VCF:
chr7 74053320 . CTGTG CTG,C 278 PASS CIGAR=1M2D2M,1M4D;RU=TG,TG;REFREP=19,19;IDREP=18,17;MQ=60 GT:GQ:GQX:DPI:AD:ADF:ADR:FT:PL 1/2:65:19:36:1,12,9:0,2,1:1,10,8:PASS:282,102,66,142,0,123
opencravat 3' normalized strelka:
S1 chr7:74053320:CTGTG>CTG becomes chr7:74053323:TGdel
S2 chr7:74053320:CTGTG>C becomes chr7:74053321:TGTGdel
octopus VCF:
chr7 74053320 . CTG C,* 266.45 PASS AC=1,1;AN=2;DP=37;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 2|1:93:37:59:74053320:100:PASS
chr7 74053320 . CTGTG C,* 48.37 PASS AC=1,1;AN=2;DP=37;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|2:48:37:59:74053320:100:PASS
opencravat 3' normalized octopus:
O1 chr7:74053320:CTG>C becomes chr7:74053321:TGdel
O2 chr7:74053320:CTGTG>C becomes chr7:74053321:TGTGdel
Although S1 & O1 look different after normalization with the 3' rule, when the hg38 mapper processes them, the mapper looks at the bases around the variants to apply the 3' rule further. This step by the mapper ensures that correct HGVS-format consequences are found. As a result, OpenCRAVAT produces the same variant consequences for the both variants (e.g. ENST00000252034.12 intron_variant c.1096+49_1096+50del).
S2 & O2 are the same and OpenCRAVAT produces the same variant consequences (e.g. ENST00000252034.12 intron_variant c.1096+47_1096+50del).
I see clear evidence of octopus working as you've described, with O1 & O2 as expected as 2 rows in the variant table. However for Strelka I'm only seeing evidence of S2.
The only entry in the variant table near this location looks like this:
base__uid = 2292288
base__chrom = chr7
base__pos = 74053321
base__ref_base = TGTG
base__alt_base = -
base__hugo = ELN
base__transcript = ENST00000252034.12
base__so = INT
base__cchange = c.1096+47_1096+50del
base__all_mappings = {"ELN": [["P15502", "", "INT", "ENST00000252034.12", "c.1096+47_1096+50del"], ["P15502", "", "INT", "ENST00000320399.10", "c.1096+47_1096+50del"], ["", "", "INT", "ENST00000320492.11", "c.988+47_988+50del"], ["P15502", "", "INT", "ENST00000357036.9", "c.1111+47_1111+50del"], ["P15502", "", "INT", "ENST00000358929.8", "c.1096+47_1096+50del"], ["P15502", "", "INT", "ENST00000380553.8", "c.745+47_745+50del"], ["P15502", "", "INT", "ENST00000380562.8", "c.1096+47_1096+50del"], ["P15502", "", "INT", "ENST00000380575.8", "c.1066+47_1066+50del"], ["P15502", "", "INT", "ENST00000380576.9", "c.1096+47_1096+50del"], ["P15502", "", "INT", "ENST00000380584.8", "c.1054+47_1054+50del"], ["", "", "INT", "ENST00000414324.5", "c.1081+47_1081+50del"], ["P15502", "", "INT", "ENST00000429192.5", "c.1111+47_1111+50del"], ["", "", "INT", "ENST00000445912.5", "c.1096+47_1096+50del"], ["", "", "INT", "ENST00000458204.5", "c.1066+47_1066+50del"], ["", "", "INT,PTR", "ENST00000466878.5", ""], ["", "", "INT", "ENST00000621115.4", "c.964+47_964+50del"]]}
clinvar__sig = Benign
clinvar__disease_refs = MedGen:CN169374
clinvar__disease_names = not specified
clinvar__rev_stat = criteria provided, single submitter
clinvar__id = 402825
extra_vcf_info__pos = 74053320
extra_vcf_info__ref = CTGTG
extra_vcf_info__alt = C
extra_vcf_info__CIGAR = 1M4D
extra_vcf_info__RU = TG
extra_vcf_info__REFREP = 19
extra_vcf_info__IDREP = 17
extra_vcf_info__MQ = 60
gnomad3__af = 0.543651
gnomad3__af_afr = 0.48686
gnomad3__af_asj = 0.604372
gnomad3__af_eas = 0.569697
gnomad3__af_fin = 0.573476
gnomad3__af_lat = 0.602031
gnomad3__af_nfe = 0.558495
gnomad3__af_oth = 0.55107
gnomad3__af_sas = 0.547005
vcfinfo__phred = 278
vcfinfo__filter = PASS
vcfinfo__zygosity = het
vcfinfo__alt_reads = 9
vcfinfo__tot_reads = 22
vcfinfo__af = 0.4090909090909091
tagsampler__numsample = 1
tagsampler__samples = HG02219
tagsampler__tags =
Should there be 1 or 2 entries in the variant table? If only 1, how should it represent S1 & S2?
There should be 2 entries in the variant table. The following is the test with which I saw two entries. The input and the tsv report file are attached.
oc run strelka.vcf --skip annotator -t tsv
In the tsv file and also with oc gui strelka.vcf.sqlite, two entries are in the result. By the way, vcf-converter version was 2.0.2.
There should be 2 entries in the variant table.
And indeed there are. I'd failed to notice one of them, because there remains a crucial difference between the results of strelka and octopus.
For strelka the results are just as you say, with 1 variant at
base__pos = 74053323
and the other at
base__pos = 74053321
however when the same process is repeated with a VCF files based on the 2 octopus VCF lines I provided above, there are again 2 variants, however they both have the same
base_pos = 74053321
Can you please repeat your oc run command from the prior comment, with this octopus VCF:
chr7 74053320 . CTG C,* 266.45 PASS AC=1,1;AN=2;DP=37;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 2|1:93:37:59:74053320:100:PASS
chr7 74053320 . CTGTG C,* 48.37 PASS AC=1,1;AN=2;DP=37;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|2:48:37:59:74053320:100:PASS
Do you see what I see, 2 variants with different base__pos?
If so, equivalent inputs produce variants reported at different locations, with secondary effects on clinvar__disease_names and numerous other fields.
I see the following from an oc run with your two octopus VCF format variants:
1 chr7 74053321 TG -
2 chr7 74053321 TGTG -
There are a few things here:
-
strelka and octopus actually give different decision on the first variant. strelka says "chr7 74053320 CTGTG CTG" meaning that positions 74053323 to 74053324 were deleted, while octopus says "chr7 74053320 CTG C" meaning that positions 74053321 to 74053322 were deleted.
-
Before mapping to transcripts, OpenCRAVAT does its best to standardize input position, reference allele, and alternate allele within the limit of what is given as input. Thus, at this stage, OpenCRAVAT does not know that whether the 3' rule can be further applied to
chr7 74053321 TG -or not. To be fair, as far as I know, no other tool inquires that, either. -
Further complication is that, if we follow HGVS's 3' rule, strelka's
chr7 74053320 CTGTG CTGand octopus'chr7 74053320 CTG Care both wrong. There are 19 repeats ofTGwhich span from the position 74053321 to the position 74053358. Thus, according to HGVS's 3' rule, the correct representation would bechr7 74053357 TG -, not 74053321 nor 74053323 (OpenCRAVAT's mapper follows these repeats and outputs correct cDNA and protein changes). -
The issue here is that there is not yet any standard which all variant callers and annotation databases agree to use. If the HGVS 3' rule was considered as the standard, strelka and octopus would be both wrong. If ClinVar would adopt the 3' rule, both strelka and octopus variants wouldn't be found in ClinVar.
-
Maybe internally applying normalization on input variants and annotation databases is a solution and this can be the value that such a tool as OpenCRAVAT can contribute, although there can be the issue of the confusion coming from the difference between an annotation source's original data and its OpenCRAVAT-processed version. Also, it would be quite a large work.
What do you think?
What do you think?
https://www.youtube.com/watch?v=dsx2vdn7gpY
more seriously, or perhaps just more productively, I really don't know. I've been vaguely aware of these sorts of issues for a quite a while, but until you've got a concrete case it's hard to make progress. I think this is useful in that way. No tool, nor any database alone can solve this. And any industry wide fix is likely to be a very slow slog. It's quite disheartening, but I expect that I can now use this thread to raise related bug reports in the https://github.com/Illumina/strelka https://github.com/luntergroup/octopus repos, and perhaps raise it to clinvar.
I suppose it also speaks to the need for an independent tool to do the 3' shifting. Otherwise there will be a lot of independent, and slightly different solutions to the problem.
Even with that, there is a timing issue very akin to https://en.wikipedia.org/wiki/Dagen_H which would be impossible in practice
https://github.com/Illumina/strelka/issues/193 https://github.com/luntergroup/octopus/issues/172
Ok. Meanwhile, OpenCRAVAT team will discuss the possibility of applying a unified process on all resources for consistency. If this strelka, octopus, and ClinVar case is solved within OpenCRAVAT, I'll post an update.
I'll also mention that while HGVS seems to dictate 3', the VCF spec seems to necessitate a 5' 'upstream' base before indels (it seems left aligned might be a more correct term). For this reason, when I've adhoced partial solutions to this, In the past I've done 5' shifting. Perhaps what you're imagining is a post-VCF step and it doesn't matter. But if there was a generic tool for shifting VCFs, it would likely take a VCF as input and spit one out. If so, having both 5' and 3' requirements might end up quite unnecessarily large, and perhaps raise pathologic issues with the difference between the individual vs the chosen reference.
https://genome.sph.umich.edu/wiki/Variant_Normalization argues for 5' normalization, and indicates that this tool can do the job https://github.com/atks/vt https://genome.sph.umich.edu/wiki/Vt#Normalization
https://annovar.openbioinformatics.org/en/latest/articles/VCF/
There is no community consensus yet on how to describe an indel in an unique way. Many users prefer to do a left-normalization. Left-normalization means that the start position of a variant should be shifted to the left utill it is no longer possible to do so, so that the smaller the number, the better. However, HGVS clearly specifies that left-normalization would be performed on cDNA (mRNA) coordinate, which means that right-normalization is required for half the genes in human genome. We will just have to accept the fact that people do not agree with each other at this point.
Currently, the following databases in ANNOVAR are left-normalized so that users can directly use them to compare to your left-normalized variant file: ... clinvar_20150330 ....
more notes that seem relevant
https://samtools.github.io/bcftools/bcftools.html#norm
https://github.com/Janchorizo/ban-vcf
https://github.com/freebayes/freebayes suggests others that might be relevant
decompose the output with tools in vcflib, vt, vcf-tools, bcftools, GATK, or Picard.
I'm seeing a broad consensus that left-aligning is the most common for vcfs (confirmed for octopus but see comments on why this isn't ideal at https://github.com/luntergroup/octopus/issues/172#issuecomment-823138437 ) and I suspect that is what both strelka and clinvar do.
https://pypi.org/project/bioutils/ bioutils.normalize – allele normalization (left shuffle, right shuffle, expanded, vcf)
expanded is new to me, and used by https://github.com/ga4gh/vrs-python & https://github.com/ga4gh/vrs the idea being that any ambiguous positioning will be expanded to cover the entire range of ambiguous positions so that comparisons are easier.
ga4gh:VA.rUTGEnc_US6EQSAh64vrXMyDVxq_TWL2
NC_000007.14:g.74053331_74053332del
becomes
ga4gh:VA.cTEuSY_2j4R7UFLTj_Rh10JjVDayk_vr
NC_000007.14:g.74053321_74053359delinsTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
and
ga4gh:VA.GaUmNc2yxPnonnHIqiWOG1iiPJJKSzB0
NC_000007.14:g.74053321_74053322del
becomes
ga4gh:VA.cTEuSY_2j4R7UFLTj_Rh10JjVDayk_vr
NC_000007.14:g.74053321_74053359delinsTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
note the difference between
NC_000007.14:g.74053331_74053332del
NC_000007.14:g.74053321_74053322del
ending up with the same expanded form