genomad icon indicating copy to clipboard operation
genomad copied to clipboard

plasmid classified as virus?

Open xinehc opened this issue 1 year ago • 7 comments

Hi,

When classifying some Refseq sequences I noticed that some plasmids are being classified as virus. For example the reference assembly of Klebsiella pneumoniae (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000240185.1/).

It seems sequence NC_016838.1 has more virus hallmarks than plasmid hallmarks which makes genomad classify this sequence as virus. However, this sequence carries AMR gene blaCTX-M, which is unlikely to show up in virus.

I am particularly interested in classifying a sequence into only chromosomes/plasmids so I wonder is it possible to prevent genomad from outputting virus? Thanks

	seq_name	length	topology	coordinates	n_genes	genetic_code	virus_score	fdr	n_hallmarks	marker_enrichment	taxonomy
1	NC_016845.1|provirus_1288374_1340563	52190	Provirus	1288374-1340563	75	11	0.9788	NA	11	91.4045	Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
2	NC_016845.1|provirus_2282085_2324920	42836	Provirus	2282085-2324920	62	11	0.9781	NA	10	79.4788	Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
3	NC_016845.1|provirus_4049987_4084759	34773	Provirus	4049987-4084759	43	11	0.9748	NA	21	52.8533	Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
4	NC_016845.1|provirus_1778390_1811349	32960	Provirus	1778390-1811349	39	11	0.9664	NA	20	40.6072	Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
5	NC_016845.1|provirus_4818868_4834971	16104	Provirus	4818868-4834971	25	11	0.9648	NA	7	24.4392	Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
6	NC_016838.1	122799	No terminal repeats	NA	136	11	0.9562	NA	18	97.9193	Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;

xinehc avatar Jun 24 '24 09:06 xinehc

Thank you for bringing this to my attention. This is an interesting case because it seems that at least a portion of the genes of this replicon are indeed of viral origin (lots of genes encoding tail proteins side-by-side), suggesting that this might be a hybrid element.

Can you check the chromosome, plasmid, and virus scores of these sequences in the <prefix>_aggregated_classification/<prefix>_aggregated_classification.tsv file? If the plasmid score is substantially higher than the chromosome score, I can evaluate adding a parameter to ignore proviruses identified within sequences classified as plasmid (maybe this could be enabled by default, if cases like this are common).

apcamargo avatar Jun 24 '24 09:06 apcamargo

NC_016838.1 has a higher plasmid_score than chromosome_score. In this case, should this sequence be classified as plasmid if I don't care about virus (given that virus rare carries AMR genes)?

seq_name        chromosome_score        plasmid_score   virus_score
NC_016845.1     0.6210  0.3022  0.0768
NC_016838.1     0.0098  0.0340  0.9562
NC_016846.1     0.0011  0.9952  0.0037
NC_016839.1     0.0016  0.9936  0.0048
NC_016840.1     0.0014  0.9941  0.0046
NC_016847.1     0.0057  0.9869  0.0075
NC_016841.1     0.0022  0.9894  0.0085

This situation is not very common: I classified 47306 AMR gene-carrying plasmid sequences (retrieved from PLSDB, refseq complete genomes and IMG/PR), only 152 are being classified as virus. The minimal plasmid/chromosome score ratio is 1.9637. Here are the classified summaries if necessary.

Archive.zip

xinehc avatar Jun 24 '24 10:06 xinehc

If you're not interested in viruses at all, you can just delete <prefix>_find_proviruses directory and then run the end-to- end command with the --disable-find-proviruses parameter. No provirus will be detected in that sequence and it will be classified as a plasmid.

If you have more cases like this, please share. I think it might make sense to disable provirus detection by default on cases where sequences have strong evidence of being plasmid.

apcamargo avatar Jun 24 '24 11:06 apcamargo

From experience, PLSDB (or at least the previous version of it) had a couple of actual phages there (I couldn't find evidence of them being hybrid elements). This is not a problem with PLSDB itself, but related to the fact that some submitters will tag all secondary replicon as plasmids and the error gets propagated to RefSeq.

apcamargo avatar Jun 24 '24 11:06 apcamargo

I just noticed that NC_016838 got a virus score higher than the plasmid score, so disabling provirus discovery won't help. I need to take a look at this manually. You can experiment with the ratio of marker enrichments, as you mentioned in the previous comment

apcamargo avatar Jun 24 '24 11:06 apcamargo

Thanks for the suggestions and comments.

Yes there are many mislabelled/misassembled plasmids sequences in RefSeq and my goal was to remove these fake plasmids. NC_016838.1 is possibly misassembled somehow, despite being labelled by RefSeq as reference genome. I will play around with the parameter to see whether this sequence should be kept.

Here are all the sequences being classified as virus/provirus in the 47306 AMR gene-carrying (putative) plasmid sequences. I noticed that some IMG/PR sequences are also being classified as virus, maybe due to version upgrade.

plasmid_virus.fna.zip

xinehc avatar Jun 24 '24 12:06 xinehc

Thanks a lot! This will be very helpful.

I'll work on this soon and make a new release.

apcamargo avatar Jun 24 '24 20:06 apcamargo