cuteSV
cuteSV copied to clipboard
Malformed VCF due to `R` reference base
Hello @tjiangHIT, @Meltpinkg and cuteSV developers,
cuteSV v2.0.3 can produce malformed VCF output containing R
nucleotides in the REF
column. These are not allowed according to the VCF v4.2 specification: REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive). The VCF v4.3 specification additionaly mentions: IUPAC ambiguity codes should be converted to a concrete base. Downstream tools such as HTSJDK throw an error correctly stating that the VCF is malformed.
For our use case this result in analysis that cannot complete.
Example:
chr10 131592457 cuteSV.DEL.1321 TCCCAGGTTCAAGTGATTCTCTTGCTTAACCCTCCCGAGTAGCTGGGATTACAGGCACCCACAAGAACACCCAGCTGATTTTTGTATTTTTAGCAGAGACAGGGTTTCACTGTGTTGGCCAGGCTGGTCTCGAACTCCTGACCTTGTGATCTGCCTGCCTTGGCCTCCCAAAGTACTGGGATTAATTATTTTTCCTTTTTAAGGTTAAATAATATTCCATTTTGTGGATATGCCACATTTTGTTTATCCATTCATCTGTCAACAGACACTTGGGTTGCTTCCATCTTTTGACTATTGTGAATAATGCTGTTGTGGACATGGGTGTAGAAACATCTCTTTGAGGCTCTGCTTTTAATTCTTTGAGGTATATACCCAGAGGTGTAATTGCTGGATCATGTGAAATCTGAGAAACCACCATATTGTTTCTATAGTTGTGTAGTATCTCACTGTGGTTTTGATTTGCATTTTCCTAATTATTCATGTTGTTGAGCATCTTTTCATGTACTTATTGGTCATTTGTATATCATTGGAGAAATATATATTCAAGTCCTTTGTCTATTTTTTAATTGTGTTGTTTTTTGGTTGTTGAATTGCAAGAGTTCTTTATATATGGATAGTAATCCGTTATCAGATATATAATTTACAAATATTTCCTGCCATTCAGTGTGTTGCCTTTTACTCTGTTGACAGTGTCATTTGATTCACAAAAATTTTTAATATTTACATGTTCCAATTATCTGATTTTTTTGTTGCCTATGCTTTCGGTGTCGTAGCCAAGAAATCCTTGCCAAATGCAATGCCATGAAGCTGTGCCCCTACATTTTCTTGTGAGTATTCTAACTCTCATATCTAAGTCTTTGACTATTTTTAATTTCTGCATATGGTGTAAGGTAAGGGTACAACTTCATTCTTTTGCATGTGGCTATCCAGTTTTCCCAGTAACATTTGTTGAAAAGACTGTCCTTTTCCCTATTGGATAGTCCTAGCAACTTTTTAAAAAATCACAAGGCCATATATACAAGAGTTTATTTCTGGGCTCTCTATTCTATCTCACTGATCTATGTGTCTGTCTATACGTCAATACCACTCTGTTTTTAATACTGTAGATTTTTAGAAATTTTGAAACTAAGAAGTGTGAGACCTCCAACTGTGTTCTTTTTCAAGATTGTTTTTGCTATTTAGGGTCCCTTGAGATTCTATATGAATGTTAGGATAGATTTTTCTAGTTTTGTAAAAAAAAATTGATGTTGGAATTTTAAGATAAATTGCATTTAATCTAGAGACCACATCTTTCAATTTTAGGTCTTCTCATCTATGAACAAAGGATGTCTATTTTTGTAGTGTCTTTAATTTCTTTGAGCAATATTTCATAGTTTTCAGTGTACACATCTTTCACCTCCTTGGTTCAGTTTGTTTCTATTTTTTATTTTGTTTGGTCCCACTTTAAATGAAATTGCTTTCTTAATTTCTTTTTCAGGTTGTTCATTGTTATTGTATAGAAACACAGCTAATTTCTGTATGCTGAGTATTCTGTAAGTTTGCTAATTTTGTTATTAGTTCTATCATGTTTCTTATGGAATCTTTGGGGTTTTCTACATATGAAATTACATCATCTATGAAAGGGATCGTTTTACTTTTTATTTCCCAATTTTAATGCTTTTTATTTCCTAATTTATCTGGTCAAGATTTCCATTACTATGCTGAATTTAAAAGTAGGCATTCTTCCCTTGTGTCTTAGCTTAGAAGAAAAGTTTTCAATCTTTCATCATTAAGTATGATGTTAGCAATGGGCTTTCCATATATGGCCTTAATTATGTTGAGGTAGTTTCCTTCTGTTCCTAGTTTGGTGRATGTTTTTTATCATGGAAAGGTGTTGGATTTTGTCAAATATTTTTCTCCATCAATTGAGATGATCACATGGGAACTGTTTCTTCATTCTGTTAATGTAGTTATTACATTAATTCATTTTCATATGTTGAACTATCCTTGAATTTCAGAAATAAATCCCACGAGGTCATGTGTATAATTTTTTTGATGTGTCACTTAATTCTGTTCACTAATATTTGGTTGAGGATTTTTACATCAGTATTTATCAGAGATATTGATCTGTAGCTTAATTTTATTGTAGTACCTTTGTCTTGCTTTGGTGAAAGAGTAATCTTGGCCTTGAAGAATAAGTTTGAAAGTGTCCCCTTACCTTAAACTTTTTTGGAAACTTTTGAGAAGGATTAGTGTTAACTCTTCTTTAAATGTTTGGTAGAATTCACGAATGAAGCCATCAGCTCCTGGGATTTTCTTTGTTGGCAGATTTTGGATCATTGATTCAATCTCTTTGCTAGTTATATGTCTGTTCGTATTTTCTATTTCTTTGTGGKTTAGTCTTGGTAGGTGGTATATGTCTAGGAATTTATCCATTTTGTCTAGGTTGTCCAATTTTTTGGCATACAAATATTCATACTATTGTCTTATTAATATAATCATTTTATTTCTGTTAAATCAGTGGTAATGTCTGCACTTACATTTCTGATTTTAGTTATTGAGACTTCCCTCTTTTATCTTACTCAGTCGAACTAATTGTTCATTAATTTTGGTGATTTTTTCAAAGAACTGAACTTGGTTTTGCTAACTTACTCTACCATGTTCCTATTCTTTATTTCAGTTGTCTGTACTCTAGTCTTTATTATTTCTTTCCTTCTACTGGATTTGGGTTTAGTGTGTTCTCCCTTTTTCTACTTCTTTAAGGTATAATGTTAGATTGTTAATTTAAGATCTTTCTTCTTGTTTATCATAAGCATTTACACTATAAACTACCCTCCTAGCACAGATTTTGATGCATCTGGTAAGTTTTGGTATGTTTACTGTAGCCCTGCAATATAGTTTGAAGTCAGGTAATGTGATGCCTCCAGCTGTGTTCTTTTTGCTTAGGGTTGCCTTGGCCATTCGGGCTCTTTTTTGGTTCCATATGAATTTTAAAATAGTTTTTTCTAGTTCTGTGAAGAATGTCATTGGTAGCTTAATAAAAATAGCATTGAATCTGTACACTGCTTTGGGCAGTATGGTCATTTTAATAAGATTGATTCTTCCTATCTGTGAGCATGAGATTTTTAAAAATTTGTTTTTGTCTTACCTGATTTCTTTCAGCAGTGCTTTGTAATTCTCACTGCAGAGATCTTTCACCTCCCTGGTTAGCTGTATTCCTAGATATTTTWTCATTTTTGCAGCAATTGTGAATGAGATTGCCTTCCTGATTTGTTTCTCGGCTTGGTTTCTTCTTGTTGTTTGTGTACAGGAATGCTGGTGATTTTTCTACATTGATTTTGTATCCTGAAACTTTGCTGAAGTTGTTTATCAGCTGAAGGAGCTTTTGGGTCRAGACTATGGGTTTTTCTAGATATAGAATCATGTCATCTGCAAATAGGGATAGTCTGATATCCTCTCTTCCTATTTGGATATGCTTTATTTCTTTATTTTGCCTGATTGCTCTGGCTAAGACTTCCAATAATACTTGAATAGGATTGGTGAAAGAAGGCATTCTTGTCACGTGTTGGTTTTCAAAAGGAATTCTTCCAGCTTTTGCCCATTTAGTATGATGTTGCCTGTTAGTTTGTCACATATGGCTCTTATTATTTTGAGTTGTGTTCCAAAACATCATGGTGCTGGTACAAAAACAGGCACATAGACCAATGSAACAGATAGAGAGCCTAGAAATAAGACTGCACACTTACAACCATCTGATCTTCAACAAAGCTGACAAAAACAAGCAATGGGGAAAAGACTCCCTATTCAATAAATGGTACTTGGATAAGTGGCTAGCCATATGCAGAAGATTGAAGGTAGACCCCTTCCTTGCACCATATACCAAAATCAACTCAAGATGGATTAAAGACTTACATATAAAACCCAAAACTATAAAAAACCCTGGGAGACAACCTAGGCAATATTATCCTGTACATAGGAATGGGCAAAGATTTCATGACAAAGCAATCACAAAAGCAATCACAACAAAAACAAAAATTGACAAATAAGATCTAATTAAACTTAAGAGCTTCTGCACAGCAAAAGAAACTATCAGCAGAGTAAACAGACAACCTACAGGATGGCAGAAAATATTTGCATATTATGCATCTGACAAAGGTCTAATATCCAGCATCTATAAGAAACTTAAACAAGTTTATAAGCAAAAAACAAACAACCCCATTAAAAAGGGGGCAAAGGACATGAACACTTCTCAAAAGAAGACATACGTGCAACCAACAAGCATATGAAGAAAAGCTCAATATCACTGATCATTAGAGAAATGCAAATAAAAACCACAACGAGATACTGTCTCACAACAATCAGAATAGCATTATTAAAAATTCAAAAAAATAACAGATACTGGTGAGGTTGTGGAGAAAAGGGACCACTTATACACTGTTGATGAAAGTGTAAGTTAGTTCAACCATTGTGGAAAGCAGTATGGCGATTCTTCAAAGAAAGAGCTAAAAACAGAATTACCATTCAACTCAGGAATCCCATTACTGGGTATATGCCCAGAGGAATATAAATCATTCCACCATAAAGACACATGCACANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAATCTCGGCTCACTGCAACCTCCGT T 0 q5 IMPRECISE;SVTYPE=DEL;SVLEN=-4698;END=131597155;CIPOS=-13,13;CILEN=-2,2;RE=10;RNAMES=NULL;AF=0.1471;STRAND=+-;AC=0;AN=2;CSQ=deletion|intergenic_variant|MODIFIER||||||||||||||||1||||1|||||||||||||||||||||||||||||||||||||||||||||||||| GT:DR:DV:PL:GQ ./.:.:.:.:. 0/0:58:10:0,78,458:78 ./.:.:.:.:.
Command:
cuteSV --sample HG002 --genotype --max_cluster_bias_INS 100 --diff_ratio_merging_INS 0.3 --max_cluster_bias_DEL 100 --diff_ratio_merging_DEL 0.3 --threads 4 vip_AshkenazimTrio_HG002.cram.bam GCA_000001405.15_GRCh38_no_alt_analysis_set.fna cutesv_output.vcf .
We've replaced Sniffles2
with cuteSV
which suffers from what seems to be the same issue including the REF_MISMATCH
warnings mentioned in that issue. Actually this issue was the primary reason for the switch.
- Would it be possible to fix the 'R' reference bases issue such that cuteSV produces valid VCF?
- Would it be possible to update cuteSV such that the content of the REF column matches the reference genome that was used when calling cuteSV?
- Would it be possible to update cuteSV to produce VCF with records sorted in the order of contigs as defined in the VCF header? This would ease handling operations like merging VCFs downstream.
Greetings, @dennishendriksen
Hello, @dennishendriksen
In cuteSV, the content of REF column is extracted from the input reference genome, so that the R nucleotides comes from the reference genome. For the three questions mentioned above:
- The 'R' reference bases can be modified via using sed commands on the output VCF.
- The content of the REF column is extracted from the reference genome so that it is already matched.
- cuteSV sorts the VCF file in the lexicographical order now. To change the way of sorting records, you can use commands to modify the output VCF to sort it in other ways.
Hope it will help.
Best, Shuqi
Thank you for your quick reply.
The 'R' reference bases can be modified via using sed commands on the output VCF.
I can confirm that this is indeed what is stored in the reference genome. The sed
workaround will probably work, thank you for the suggestion.
In order to produce valid VCF you might consider the following for a proper fix:
From https://samtools.github.io/hts-specs/VCFv4.3.pdf page 8:
If the reference sequence contains IUPAC ambiguity codes not allowed by this specification (such as R = A/G), the ambiguous reference base must be reduced to a concrete base by using the one that is first alphabetically (thus R as a reference base is converted to A in VCF.)
The content of the REF column is extracted from the reference genome so that it is already matched. I keep seeing differences between the REF column content of cuteSV and the reference genome, for example:
chr10 132172946 cuteSV.BND.9 N ]chr4:182834343]N 0 q5 IMPRECISE;SVTYPE=BND;RE=5;RNAMES=NULL;AF=0.1282;AC=0;AN=2
Check:
$ samtools faidx GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz chr10:132172946-132172946
>chr10:132172946-132172946
C
Alternative check:
$ bcftools norm --check-ref w --fasta-ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz cutesv.vcf.gz
REF_MISMATCH chr10 132172946 N C
Additional examples:
REF_MISMATCH chr1 143203947 N A
REF_MISMATCH chr1 143203948 N A
REF_MISMATCH chr1 143211397 N G
REF_MISMATCH chr1 143211405 N T
REF_MISMATCH chr1 143242873 N G
REF_MISMATCH chr10 26893473 N G
REF_MISMATCH chr10 26894430 N C
REF_MISMATCH chr10 26894431 N C
REF_MISMATCH chr10 26894434 N T
REF_MISMATCH chr10 38918271 N A
REF_MISMATCH chr10 38918278 N T
REF_MISMATCH chr10 38918279 N G
REF_MISMATCH chr10 38918285 N G
REF_MISMATCH chr10 38969688 N C
REF_MISMATCH chr10 38969692 N C
REF_MISMATCH chr10 38969722 N G
REF_MISMATCH chr10 38969741 N C
REF_MISMATCH chr10 38969749 N C
REF_MISMATCH chr10 38969765 N C
REF_MISMATCH chr10 42066267 N C
REF_MISMATCH chr10 42066268 N G
REF_MISMATCH chr10 42119794 N A
REF_MISMATCH chr10 132172946 N C
REF_MISMATCH chr10 132172954 N T
REF_MISMATCH chr10 133004611 N A
REF_MISMATCH chr10 133740610 N C
REF_MISMATCH chr10 133740611 N G
REF_MISMATCH chr10 133740612 N C
REF_MISMATCH chr11 175283 N C
REF_MISMATCH chr11 176543 N T
REF_MISMATCH chr17 21290420 N T
This seems like an cuteSV issue? Working around this issue is tricky, because the N
REF is part of the ALT as well (e.g. N ]chr4:182834343]N
). I'm curious to hear about your thoughts.
cuteSV sorts the VCF file in the lexicographical order now. To change the way of sorting records, you can use commands to modify the output VCF to sort it in other ways.
That makes perfectly sense. I'll do some postprocessing afterwards.
This seems like an cuteSV issue? Working around this issue is tricky, because the N REF is part of the ALT as well (e.g. N ]chr4:182834343]N ). I'm curious to hear about your thoughts.
We are still struggling with working around this issue. Any thoughts?