MassBank-data
MassBank-data copied to clipboard
Extended SMILES and mismatching identifiers in MSBNK-EPA-ENTACT_AGILENT001211 to MSBNK-EPA-ENTACT_AGILENT001216
Hi @alexchao32
Thanks for your new records in https://github.com/MassBank/MassBank-data/pull/244 :-)
We've just crunched the data for PubChem and one SMILES failed the deposition:
CC[N+](CC1=CC(=CC=C1)S(O)(=O)=O)=C1C=CC(C=C1)=C(C1C=CC=CC=1)C1C=CC(=CC=1)N(CC1C=C(C=CC=1)S(O)(=O)=O)CC |c:14,t:21|
Is there any reason for using the extended SMILES format (the |c:14,t:21|
at the end)?
I found entries matching, or closely matching, the InChIKey and SMILES in both PubChem and CompTox, but see no evidence of an extended SMILES anywhere (I also asked @ChemConnector if he knew more). The DTXSID (DTXSID3020671) and the PubChem CID in the record (20803) actually points to a salt species.
The SMILES (without end) and InChIKey in the record would point to CID 20804. We fixed the deposition to use the InChIKey and ignore the SMILES.
Ideally, we'd need to clean up the identifiers in this record - do we need the extended SMILES or can the |c:14,t:21|
be trimmed? Should we update DTXSID and PubChem CID to match the InChIKey? The molecular formula and mass match the InChIKey.
The parent DTXSID5048001, however, has a different charge, and they do not seem to have an entry matching the InChIKey exactly.
(@meier-rene do we need to add new checks to the validation?)
These are the corresponding records:
MSBNK-EPA-ENTACT_AGILENT001211
MSBNK-EPA-ENTACT_AGILENT001212
MSBNK-EPA-ENTACT_AGILENT001213
MSBNK-EPA-ENTACT_AGILENT001214
MSBNK-EPA-ENTACT_AGILENT001215
MSBNK-EPA-ENTACT_AGILENT001216
Please let us know what you think and whether we should fix, or if you'd prefer to submit new versions of these records.
Thanks, Emma