Discrepant results between similar Salamae genomes
Hello,
We've isolated some subsp salamaes for one of our projects. I have a few questions about the SISTR output for these isolates:
| sample | cgmlst_found_loci | cgmlst_matching_alleles | cgmlst_subspecies | o_antigen | serogroup | serovar | serovar_antigen | serovar_cgmlst | O antigen prediction | H1 antigen prediction(fliC) | H2 antigen prediction(fljB) | Predicted identification | Predicted antigenic profile | Predicted serotype | average_depth | snp_count | indel_count | N_count | reads_cov | Reference | Organism from Esmie | Salmonella genus |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CQJ13L | 330 | 129 | salamae | - | B | II 1,4,12,[27]:a:z39|II 4:a:z39 | II 1,4,12,[27]:a:z39|II 4:a:z39 | II 1,4,[5],12,[27]:b:[e,n,x] | 4 | a | z39 | Salmonella enterica subspecies salamae (subspecies II) | 4:a:z39 | II [1],4,12,[27]:a:z39 | 70.3072 | 40928 | 0 | 0 | 89.86 | GCF_019339485.1 | Salmonella species | |
| CQJ127 | 330 | 134 | salamae | - | B | II B:-:e,n,x | II B:-:e,n,x | II 1,4,[5],12,[27]:b:[e,n,x] | 4 | z | e,n,x | Salmonella enterica subspecies salamae (subspecies II) | 4:z:e,n,x | II [1],4,12,27:z:e,n,x | 62.4964 | 40202 | 0 | 0 | 89.21 | GCF_019339485.1 | Salmonella Typhimurium | Salmonella typhimurium |
- Neither of these samples have a prediction in the o_antigen column, but when I blast them against your database, they have quite a good match (97% similarity, >99% coverage) to "304|584|1,4,12,27|B" from the wzx database. Is this match not good enough to call the O antigen? Or is there uncertainty about record 304?
- They are quite similar results across most fields (and for my blast results against the wxy and wxz databases), but they have different results in the output. How come?
Here are the fasta files, in case you want to dig in.
https://www.dropbox.com/scl/fi/ppkiflqyvtwrn28nah9s6/CQJ127_S25_L001.fna?rlkey=t19wktth1uguutnmqkuej3jq7&dl=0 https://www.dropbox.com/scl/fi/k4yg4iwo1k0mjdinyahx0/CQJ13L_S23_L001.fna?rlkey=woawgupve273z3b7y7tx2xipi&dl=0
Thanks,
Phil
Hello,
These are complex isolates to type as serovars are not summarized by single name but rather an antigenic profile. SISTR uses antigens, cgMLST and MASH (if selected) to provide a final serovar call with antigen results taking precedence overall all other evidences.
The O antigen values summarized by o_antigen field is deduced from the serovar by reverse WHO known serovars table lookup sistr/data/Salmonella-serotype_serogroup_antigen_table-WHO_2007.csv. The most informative is the json output format specified via -f json option that provides all intermediate and reliability values. For both samples I would use cgMLST serovar as a final serovar.
Serovar prediction logic
-
CQJ127- the final serovar was assigned by the antigenic O and H antigens alleles database as there was no good match between cgMLST, MASH and O and H antigen BLAST results. Looking at H2 antigen hits for thefljbgene, there is almost perfect match toe,n,xantigen. For H1 antigen andfliCgene there was almost perfecte,n,xhit, but after filtering the antigen to the serovar tableSalmonella-serotype_serogroup_antigen_table-WHO_2007.csvthere are not serovars that have both H1=e,n,xand H2=e,n,xvalue, so the H1 antigen was not assigned. Similarly there is an almost perfect hit for O-antigen1,4,12,27, but again it is not reported due to no match to the antigen to serovar tableSalmonella-serotype_serogroup_antigen_table-WHO_2007.csv. There seem to be an issue with the H1 or H2 antigen correct assignment. The H1 possible expected value could beb,a,l,vbased on the antigen to serovar metadata. Please note that there are only 134/330 cgMLST 100% matching alleles hinting that there might be extra work needed to polish the assembly or that these isolates are of a new serovar. The most probable WKLM serovar isII 1,4,[5],12,[27]:b:[e,n,x]the only caveat is the H1 antigen was not detected asb. Here are predictions from the 3 sources- antigens:
II B:-:e,n,x - cgMLST:
II 1,4,[5],12,[27]:b:[e,n,x] - MASH:
II 4,12:e,n,x:1,2,7(based on the closest reference genome)
- antigens:
-
CQJ13L- the final serovar was assigned by the antigenic O and H antigens alleles database and it is a mixed called based on the|symbol. This sample is of higher quality thanCQJ127. The H1 antigen is clearlyaand H2 antigen isz39that predict O antigen as1,4,12,[27]. The O-antigen is reported is none due to antigen to serovar table mismatch, but the top hit is the expected1,4,12,27. The cgMLST call in this case is 129/330 which is a weak predictor. Thus the antigen prediction is the most reliable. The final serovar is most probably isII 1,4,12,[27]:a:z39. I checked theSalmonella-serotype_serogroup_antigen_table-WHO_2007.csvand there isII 4:a:z39,"1,4,12,[27]",a,z39,,B,FALSE,salamaeentry that is redundant in my opinion giving this mixed antigenic call. Here are predictions for the 3 sources- antigens:
II 1,4,12,[27]:a:z39|II 4:a:z39(the|means OR, just pick one) - cgMLST:
II 1,4,[5],12,[27]:b:[e,n,x](a complete miss in my opinion due to to less than <50% cgMLST alleles matching) - MASH:
II 4,12:e,n,x:1,2,7at 0.00594242 mash distance
- antigens:
Both samples belong to subspecies salamae but serovars are different. We provide all information from all evidences so the end users can finalize the serovar prediction. We are currently working on the version 1.1.3 release update that will be released soon and provide more transparent serovar prediction logic messages in the log.
SISTR v1.1.2 results
SeqSero2 results for comparison
Input files: CQJ127_S25_L001.fna O antigen prediction: 4 H1 antigen prediction(fliC): 1,2,7 H2 antigen prediction(fljB): e,n,x Predicted identification: Salmonella enterica subspecies salamae (subspecies II) Predicted antigenic profile: 4:1,2,7:e,n,x Predicted serotype: II 4:1,2,7:e,n,x Note: This predicted serotype is not in the Kauffman-White scheme.
Input files: CQJ13L_S23_L001.fna O antigen prediction: 4 H1 antigen prediction(fliC): a H2 antigen prediction(fljB): z39 Predicted identification: Salmonella enterica subspecies salamae (subspecies II) Predicted antigenic profile: 4:a:z39 Predicted serotype: II [1],4,12,[27]:a:z39 Note:
WKLM scheme
Thanks very much, apologies for the slow response!
The issue is partially addressed in new SISTR release v1.1.3
Great, thanks.
On Tue, 26 Nov 2024 at 20:33, Kirill Bessonov @.***> wrote:
Closed #57 https://github.com/phac-nml/sistr_cmd/issues/57 as completed.
— Reply to this email directly, view it on GitHub https://github.com/phac-nml/sistr_cmd/issues/57#event-15440731418, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAT6FEQN7JNTNZCBRWRX6RT2CS5IZAVCNFSM6AAAAABNSXCQQ2VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJVGQ2DANZTGE2DCOA . You are receiving this because you authored the thread.Message ID: @.***>