Duplicate entries of identical MS2 spectra
Hi all,
I working with the MS2 spectra in MassBank and noticed that some spectra occur multiple times in de database. I wrote a small script looking for entries with duplicate splash, and found for example the entries:
which not only show the exact same MS2 spectrum (intensity and m/z values), as they should based on the splash, but also the same metadata. I suppose this spectrum was submitted twice, as the chances for it to be measured again with identical intensities and m/z’s seems very slim to me.
To add to the confusion, there are also instances where the metadata is not identical. For example, the entries:
have the same splash, but are assigned to different isomeric precursor compounds. Here, the isomers differ in the position of a hydroxy group and in stereochemistry. Perhaps the contributors were unsure about the structural assignment? Yet this is not communicated somewhere on the display page. Moreover, this specific splash actually occurs 6 times in the database.
When looking through more examples I realized that there are cases where identical MS2’s may make sense, such as when there’s only 1 peak in the spectrum, e.g., the chlorine anion:
However, even for a 1-peak spectrum I would be suspicious of both m/z and intensity to be exactly identical when the submissions originate from the same batch.
This is all to say, should these entries be cleaned up a bit?
Based on the most recent release (MassBank_RIKENformat.msp) I found a total of 2010 MS2 spectra that occur 2 or more times (up to 6), based on their splash. If the database would be dereplicated, about 2151 entries could be removed.
I would be happy to supply a list of the duplicates and look into some more examples if that helps the discussion. Let me know!
Cheers, Kas
Dear @kashout, Thanks so much to trickling this down. Indeed, a list of suspected spectra is welcome to improve the quality of MassBank. With this list, we can curate the spectra. Usually, we ask the contributors to fix / clean their spectra.
Case 1 makes sense, we can deprecate such records.
Case 2 is a wrong annotation. 4-aminopyridin is a totally different compound with a different structure than o-desmethyltramadol.
Case 3 makes sense, but a m/z 34 fragment is obviously not very diagnostic.
@schymane and @sneumann, I suggest splitting Kas' list to the single contributors and open one issue per contributor. I can take care on this.
Best Tobi
Hi @tsufz,
Here's the list: duplicate_spectra_info.csv and the code to reproduce it: gist.
Some additional remarks. Earlier I mentioned that the duplicates were assigned based on their splash. This is not actually the approach I ended up using. Since the splash is based on the spectra with normalized intensity, you will get a lot more false-positive duplicates originating from 1-peak spectra, as they only need to have identical m/z. For example:
Based on splash alone, you will end up with 3053 sets of duplicates, where as the safer approach based on hashes of the 'raw' spectra yields the 2010 sets of duplicates contained in the .csv file above.
And a heads up, a substantial number of spectra from the RIKEN_IMS submissions appear not only to be duplicated across entries, but also to be profile spectra rather than centroided. I am planning to open a separate issue regarding this either today or early next week.
Please let me know if I can help further with the list or if anything needs clarifying.
Cheers,
Kas
@meier-rene @sneumann is it true that the SPLASH in MassBank is calculated on the normalized intensity? This should not be the case, we should be using the absolute numbers otherwise indeed we will have a much higher potential for clashes, and the SPLASH is not serving it's purpose. @meier-rene is this checked / verified in the validation?
@kashout txs for submitting the list. I will take care on it. But let's wait for the response of @meier-rene on https://github.com/MassBank/MassBank-data/issues/346#issuecomment-3501608049
Some code to do so.
Best Tobias
@meier-rene @sneumann is it true that the SPLASH in MassBank is calculated on the normalized intensity? This should not be the case, we should be using the absolute numbers otherwise indeed we will have a much higher potential for clashes, and the SPLASH is not serving it's purpose. @meier-rene is this checked / verified in the validation?
In some cases, similar splashes cannot be avoided.
For example: MSBNK-Athens_Univ-AU590404 MSBNK-Athens_Univ-AU590904
These are quite similar compounds resulting in exact same fragmentation. The records are correct. They only can be distinguished by retention time, not by their MS.
Hi all, I'm slowly crawling through all the issues and requests we got in last week. I have one short comment/hint here for the splash question. No, we don't use normalized intensities in the splash. But the splash seems to normalizes internally, because all variations of intensities with multiples of the same spectra lead to the same splash.
Hi all, I'm slowly crawling through all the issues and requests we got in last week. I have one short comment/hint here for the splash question. No, we don't use normalized intensities in the splash. But the splash seems to normalizes internally, because all variations of intensities with multiples of the same spectra lead to the same splash.
I just checked the code on SPLASH website. The peaks are scaled to the maximal peak intensity during calculation of the third block.
Hence, I will go ahead to prepare the issues for record review.
In some cases, similar splashes cannot be avoided.
For example: MSBNK-Athens_Univ-AU590404 MSBNK-Athens_Univ-AU590904
These are quite similar compounds resulting in exact same fragmentation. The records are correct. They only can be distinguished by retention time, not by their MS.
@tsufz I think this example is still quite odd. I agree that these compounds likely have very similar MS2's, but for them to be fully identical? Even on back to back injections I don't think that happens. Moreover, it seems that also their retention time is identical on 3 decimals. Are you sure then that the records are correct?
I found about a 100 more of these examples, with identical ms2 and rt, but different inchikeys. I doubt that it could be this prevalent.
Hi, thanks for raising this. My first guess was that this is about single-peak spectra from nominal mass instruments, where such behavior would be absolutely correct. But the first example I checked was a pair of rich spectra of quite different records: https://massbank.eu/MassBank/search?splash=splash10-00e9-0900000000-043f11c9eaa4a5da3398
I can see several reasons for this to appear:
- The MS/MS extraction code of contributors chose the same MS/MS for two different compounds.
- There was renaming issues, where e.g. XX0001 was originally submitted, and later corrected to be a different compound and re-submitted as XX0002, without removing/deprecating XX0001. Not the most likely case.
- The spectra are really identical up to the 3rd decimal. I think we can rule this one out :-)
I would guess that the reasons are similar within submissions by one lab, so we can start hunting those down.
Yours, Steffen
Hi, I agree with your assessment Steffen that it's probably best for the individual contributors to dereplicate and use their knowledge of how those spectra were submitted! Tobias, I see you've started opening individual issues. Thanks a lot for undertaking this unpleasant task!
@meier-rene @sneumann @schymane To circle back one last time to the splash discussion. Here's an excerpt from the original paper:
Although the mapping from object to hash should ideally be unique, hash collisions (where two totally different objects have the same hash, or fourth block of the SPLASH) may occur, depending on the hash algorithm and length of the hash string. Testing the fourth block for hash collisions on the full data set of 53,250,921 spectra (563,902 from the validation set and 52,687,019 from BinBase14) revealed that identical SPLASHes arose only from mass spectra containing a single ion of the same mass, where the SPLASH is identical by definition due to intensity normalization. The theoretical probability for a collision15 with any given hash is approximately 10−31 for a database containing 109 spectra and is further reduced by the presence of two preceding spectral summary blocks. Thus, the SPLASH fulfills its role as a unique identifier while offering simple summary and searching functionality.
My interpretation is that the normalization choice was deliberate, and I can see the practical benefits. But given that, relying only on the splash for future curation of duplicates is not an option. Perhaps we need a new standard for MS2 hashing...
Cheers, Kas
Hi @kashout, Another issue worth to look into is the decomposition of mass spectra and the optimal collision energy range, as shown in our analysis. This can be very different from machine to machine, even on similar settings.
An example is MSBNK-BAFG-CSL23111017229 and MSBNK-BAFG-CSL23111017224 of 2,4 D with CE 120 V and 140 V, respectively.
The compound is definitely decomposed and no chance to get more fragments because it's in the lowest edge anyway.
Best Tobias
Hi, thanks for raising this. My first guess was that this is about single-peak spectra from nominal mass instruments, where such behavior would be absolutely correct. But the first example I checked was a pair of rich spectra of quite different records: https://massbank.eu/MassBank/search?splash=splash10-00e9-0900000000-043f11c9eaa4a5da3398
I can see several reasons for this to appear:
1. The MS/MS extraction code of contributors chose the same MS/MS for two different compounds. 2. There was renaming issues, where e.g. XX0001 was originally submitted, and later corrected to be a different compound and re-submitted as XX0002, without removing/deprecating XX0001. Not the most likely case. 3. The spectra are really identical up to the 3rd decimal. I think we can rule this one out :-)I would guess that the reasons are similar within submissions by one lab, so we can start hunting those down.
Yours, Steffen
Hi @sneumann, In the case shown, the interpretation suggests 1. isobaric compounds and 2. a wrong annotation. I think, it's the latter. I looked into other submissions. In both cases, I found UFZ spectra which show up quite different retention times of those compounds, and thus I would exclude an isobaric condition.
@kashout, another line of evidence to hunt down. We may collect a list of known isobaric substances from some contributors to support interpretation. At least Martin at UFZ maintains this.
Best Tobias
@tsufz, I've been going through some more of the duplicates and I found too many edge cases to cover everything with one generic script to detect 'real' duplicates. I think a better approach would be to do this on the contributor level. I hope to have some time next week to check if the lists that went to the individual contributors are complete or can be pruned a bit.