viralrecon icon indicating copy to clipboard operation
viralrecon copied to clipboard

SARS-CoV-2 ORF1b mutation reporting differences between SnpEff and Nextclade

Open peterk87 opened this issue 2 years ago • 9 comments

We've encountered issues with the reporting of ORF1ab mutations falling in the ORF1b region where the -1 frameshift is not taken into account by SnpEff.

Nextclade uses a different GFF than viralrecon when using MN908947.3 genome profile:

.	.	gene	266	13468	.	+	.	 gene_name=ORF1a
.	.	gene	13468	21555	.	+	.	 gene_name=ORF1b

https://github.com/nextstrain/nextclade/blob/96e42468cae447afdbc9d0eda73f160e8049e6e4/data/sars-cov-2/genemap.gff#L8

MN908947.3	Genbank	gene	266	21555	.	+	.	ID=gene-orf1ab;Name=orf1ab;gbkey=Gene;gene=orf1ab;gene_biotype=protein_coding
MN908947.3	Genbank	CDS	266	13468	.	+	0	ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1
MN908947.3	Genbank	CDS	13468	21555	.	+	0	ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1

https://github.com/nf-core/test-datasets/blob/viralrecon/genome/MN908947.3/GCA_009858895.3_ASM985889v3_genomic.200409.gff.gz

So for example a C14322T nucleotide substitution will be reported by SnpEff as a T4686I AA substitution, but Nextclade will only report the nucleotide substitution and no AA substitution.

The AA at position 4686 should be Y instead of the T reported by SnpEff:

image

screenshot from https://www.ncbi.nlm.nih.gov/protein/QHD43415.1?report=graph

Potential solution

Use a custom GFF with separate ORF1a and ORF1b "gene" entries like Nextclade does.

peterk87 avatar Jan 28 '22 17:01 peterk87

Thanks for reporting @peterk87! Not sure what the best solution is here because by default the pipeline will and has always used the same reference fasta and annotation for reproducibility. It is up to the users to overwrite this configuration if required. Either we go low maintenance and add in a warning to ask users to point to their annotation or we deal with it internally somehow. Thoughts?

Ping @saramonzon.

drpatelh avatar Jan 31 '22 09:01 drpatelh

yes..we've encountered this issue too, it's not only the orf1b gene, the case is that the NC_045512.2 fasta from refseq and the MN908947.3 fasta from genebank are exactly the same, but the gff from refseq is more updated and therefore correct. However the primer beds and stuff uses MN908947.3 genome idenfifier, that's why that is used in viralrecon. As @drpatelh says we overwrite the fasta, gff and primer bed files (in this case we just substitute MN908947.3 to NC_045512.2, as the coordinates are the same) so we get the correct annotation. I guess if nextclade is using the correct annotation but against MN908947.3, they should have done something similar but substituting the genome identifier in the gff.

It does not have a good solution in terms of reproducibility, it is true that if we want the best results for the defaults parameters we should be using the gff for NC_045512.2, but that means manually changing either the primer bed files or the gff...

saramonzon avatar Jan 31 '22 10:01 saramonzon

Most other resources like the ARTIC primer sets etc all use MN908947.3 by default which is why it made sense to pick that. It meant we could just point to their primer beds and not have to host our own.

We could introduce a breaking change in the next release where we use a more updated annotation for MN908947.3 by copying across the GFF from NC_045512.2 and changing the values in the first column? But maybe we should also provide a mechanism where the older version can be used?

Are you using this annotation @saramonzon ? https://github.com/nf-core/configs/blob/9792226f5cecebf400205d435461ab3d047bed50/conf/pipeline/viralrecon/genomes.config#L15

drpatelh avatar Jan 31 '22 10:01 drpatelh

I've just checked they've just updated de gff for both genebank and refseq!! Refseq: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/ Genbank: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/858/895/GCA_009858895.3_ASM985889v3/

saramonzon avatar Jan 31 '22 11:01 saramonzon

Nice! Thanks @saramonzon

@peterk87 are you able to download the latest annotation for MN908947.3 and confirm that it fixes the issue for you?

drpatelh avatar Jan 31 '22 12:01 drpatelh

I have downloaded the latest GFF file for MN908947.3 and I can't see a difference in the primary annotation which probably means it won't fix this issue. I have attached the files here to help troubleshoot. Any other suggestions?

MN908947.3

Download date: 04th April 2020 (current default)

GCA_009858895.3_ASM985889v3_genomic.200409.fna.gz GCA_009858895.3_ASM985889v3_genomic.200409.gff.gz GCA_009858895.3_ASM985889v3_genomic.200409.gtf.gz

Download date: 31st January 2022

GCA_009858895.3_ASM985889v3_genomic.220131.fna.gz GCA_009858895.3_ASM985889v3_genomic.220131.gff.gz GCA_009858895.3_ASM985889v3_genomic.220131.gtf.gz

NC_045512.2

Download date: 04th April 2020 (current default)

GCF_009858895.2_ASM985889v3_genomic.200409.fna.gz GCF_009858895.2_ASM985889v3_genomic.200409.gff.gz GCF_009858895.2_ASM985889v3_genomic.200409.gtf.gz

Download date: 31st January 2022

GCF_009858895.2_ASM985889v3_genomic.220131.fna.gz GCF_009858895.2_ASM985889v3_genomic.220131.gff.gz GCF_009858895.2_ASM985889v3_genomic.220131.gtf.gz

drpatelh avatar Jan 31 '22 15:01 drpatelh

Hi @drpatelh,

Thanks for looking into this so quickly! I was hoping there was a SnpEff option to get the correct AA coordinates for ORF1b mutations.

Unfortunately, the updated GFF from NCBI did not produce the results we wanted. For now we've resorted to using a custom GFF based on GCA_009858895.3_ASM985889v3_genomic.220131.gff and it produces results that are in line with what we get from Nextclade.

In the modified GFF, the ORF1ab gene entry is replaced with 2 gene entries for ORF1a and ORF1b:

$ diff GCA_009858895.3_ASM985889v3_genomic.220131.gff MN908947.3-orf1a-orf1b-gene-split.gff
10,12c10,13
< MN908947.3    Genbank gene    266     21555   .       +       .       ID=gene-orf1ab;Name=orf1ab;gbkey=Gene;gene=orf1ab;gene_biotype=protein_coding
< MN908947.3    Genbank CDS     266     13468   .       +       0       ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1
< MN908947.3    Genbank CDS     13468   21555   .       +       0       ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1
---
> MN908947.3    Genbank gene    266     13468   .       +       .       ID=gene-ORF1a;Name=ORF1a;gbkey=Gene;gene=ORF1a;gene_biotype=protein_coding
> MN908947.3    Genbank CDS     266     13468   .       +       0       ID=cds-QHD43415.1;Parent=gene-ORF1a;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1a;product=ORF1a polyprotein;protein_id=QHD43415.1
> MN908947.3    Genbank gene    13468   21555   .       +       .       ID=gene-ORF1b;Name=ORF1b;gbkey=Gene;gene=ORF1b;gene_biotype=protein_coding
> MN908947.3    Genbank CDS     13468   21555   .       +       0       ID=cds-QHD43415.1;Parent=gene-ORF1b;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1b;product=ORF1b polyprotein;protein_id=QHD43415.1

Like you said, it's not great from a reproducibility standpoint to use a modified unofficial GFF, but at least we get the expected results for the ORF1b region.

peterk87 avatar Feb 01 '22 17:02 peterk87

Thanks @peterk87 ! That is very helpful. Not sure what to do about this since most of the standard annotations don't have an explicit entry for ORF1b. Will leave this issue open here as reference.

drpatelh avatar Feb 03 '22 16:02 drpatelh

It's also interesting to note that GISAID/CoVsurver reports AA substitutions for the component NSPs of ORF1ab/ORF1a, e.g.

N_G215C
NSP3_A1711V
Spike_P809S
Spike_T95I
N_D63G
NS7a_E91D
...
Spike_D614G
NSP4_V167L
Spike_L452R

This info is also present in the GISAID metadata table (metadata_tsv_yyyy_MM_dd.tar.xz) you can download from GISAID under the "AA Substitutions" field. Unfortunately, I haven't been able to find much info about how or what they use for generating this info.

peterk87 avatar Feb 03 '22 18:02 peterk87