psi-ms-CV
psi-ms-CV copied to clipboard
Add a `scan number` term
Describe the new term or terms you would like to add.
For a long time, several search engines, including MSFragger and Comet (correct me if I am wrong), use scan number
to identify/index a scan/spectrum. Using a single integer is faster and memory more efficient compared to using a string.
As far as I know, there are two major approaches to get a scan number
from a spectrum's metadata
-
index + 1
- Extract the
scan number
from theid
. For example, given a Thermo idcontrollerType=0 controllerNumber=1 scan=5
, thescan number
is 5
The first approach has issues: if the mzML file is a subset of the original mzML, the index
is re-assigned starting from 0, which makes the scan number
different from those in the original mzML. A typical example is that, in FragPipe, MSFragger generates _(un)calibrated.mzML
files only containing MS2 scans, for downstream tools to use. There would be problems if using index+1
as the scan number
.
The second approach works well for most Thermo data because the spectrum id
is "1-D": only the scan
changes in the controllerType=0 controllerNumber=1 scan=N
format. We extract the N
as the scan number
. But for the data from some other venders, this approach doesn't work because there are multiple fields changing in the spectrum id
. For example, function=2 process=0 scan=1
: both function
and scan
change from scan to scan. Due to this reason, we discontinue the support of Waters and SCIEX data.
Recently, several users (e.g., https://github.com/Nesvilab/MSFragger/issues/324 and https://x.com/michaellazear/status/1782905716896100437) request us to bring the support back. It would make the life much easier if those data index the scans using 1-D
schema. Thus, I proposed to add a scan number
term to be used by mzML, pepXML, and other XML-based files.
I hope my explanations are clear. Let me know if you have any questions.
Best,
Fengchao