IM peak picking
Description
This PR introduces a new representation type for ion mobility (IM) data in mass spectra within OpenMS. The new representation is tailored for peak data that has been centroided in the ion mobility dimension. It is designed to handle cases where ion mobility data has undergone centroiding and requires distinct handling compared to other types of IM data.
E.g., if we perform centroid (peak picking) in IM dimension (of mobilograms) we collapse (as discussed with Mohammed) the IM frame into a single spectrum (e.g., sorted by m/z) + a IM float data array.
It now also contains Mohammed's algorithm.
Checklist
- [ ] Make sure that you are listed in the AUTHORS file
- [ ] Add relevant changes and new features to the CHANGELOG file
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] New and existing unit tests pass locally with my changes
- [ ] Updated or added python bindings for changed or new classes (Tick if no updates were necessary.)
How can I get additional information on failed tests during CI
Click to expand
If your PR is failing you can check out- The details of the action statuses at the end of the PR or the "Checks" tab.
- http://cdash.seqan.de/index.php?project=OpenMS and look for your PR. Use the "Show filters" capability on the top right to search for your PR number. If you click in the column that lists the failed tests you will get detailed error messages.
Advanced commands (admins / reviewer only)
Click to expand
/reformat(experimental) applies the clang-format style changes as additional commit. Note: your branch must have a different name (e.g., yourrepo:feature/XYZ) than the receiving branch (e.g., OpenMS:develop). Otherwise, reformat fails to push.- setting the label "NoJenkins" will skip tests for this PR on jenkins (saves resources e.g., on edits that do not affect tests)
- commenting with
rebuild jenkinswill retrigger Jenkins-based CI builds
:warning: Note: Once you opened a PR try to minimize the number of pushes to it as every push will trigger CI (automated builds and test) and is rather heavy on our infrastructure (e.g., if several pushes per day are performed).
Some questions + comments.
-
Is a specific peak picker for ion mobility required or can we just use a current peak picker with different settings? @alhigaylan might know best?
- An extension of this slightly off topic is would it make sense to just have a single peak picker under MATH maybe even a new folder of PEAK_PICKING or something like that. This will prevent confusing use cases that previously were used like hacking the PeakPickerHiRes to use it on RT peak picking.
-
I am having some difficulty how the centroided data differs from the concatenated data.
-
Should the centroiding of IM somehow be tied to m/z centroiding? E.g. would we ever have an instance where IM is centroided and m/z is not centroided?
Some questions + comments.
- Is a specific peak picker for ion mobility required or can we just use a current peak picker with different settings? @alhigaylan might know best?
@alhigaylan uses same algorithm as PeakPickerHiRes IIRC. I am currently looking into ion mobility trace detection (using the algorithm from MassTraceExtractor). So technically, we can reuse existing low-level algorithms but depending on the data (e.g., MS1 vs. MS2 different algorithms might fit better). For other developers, it might be beneficial to provide distinct classes (think: similar but not identifcal interfaces) for m/z, chromatogram or IM peak picking. Mainly because different data types require different checks for valid data (e.g., profile data for m/z picking etc.) and the number of implementations might differ for different data modalities.
- An extension of this slightly off topic is would it make sense to just have a single peak picker under MATH maybe even a new folder of PEAK_PICKING or something like that. This will prevent confusing use cases that previously were used like hacking the PeakPickerHiRes to use it on RT peak picking.
This could be one way. E.g. we could think about having a PeakPickerProfileSpectrum, PeakPickerChromatogram, PeakPickerIMFrame classes in such a folder and move the generic code (e.g. template) from PeakPickerHiRes to PeakPicker class. Or as in the screenshot: in the CENTROIDING folder. This might be cleaner would change some python bindings etc..
- I am having some difficulty how the centroided data differs from the concatenated data.
I am also not 100% sure (ping @cbielow) but my impression was that CONCATENATED is always the full frame stored in a single spectrum via concatenation of individual scans. The concatenated data is much larger than the IM centroided one because IM values in the float data array are repeated for all peaks of a scan. After IM centroiding, each resulting m/z + IM peak can have different IM values and is usually sorted by m/z (I think this is not true for CONCATENATED as basically scans are just added one after the other starting at low m/z values again).
- Should the centroiding of IM somehow be tied to m/z centroiding? E.g. would we ever have an instance where IM is centroided and m/z is not centroided?
Excellent point, and I was also wondering about this today. I think we could make a decision here and assume that we only pick in IM dimension if m/z dimension is already picked.
Or we just add it to PeakPickerHiRes and add a a parameter to also pick the IM dimension. Less composable but maybe easier for users?
Thank you @timosachsenberg for starting this up.
Here is my take for some the points discussed here.
Some questions + comments.
- Is a specific peak picker for ion mobility required or can we just use a current peak picker with different settings? @alhigaylan might know best?
@alhigaylan uses same algorithm as PeakPickerHiRes IIRC. I am currently looking into ion mobility trace detection (using the algorithm from MassTraceExtractor). So technically, we can reuse existing low-level algorithms but depending on the data (e.g., MS1 vs. MS2 different algorithms might fit better). For other developers, it might be beneficial to provide distinct classes (think: similar but not identifcal interfaces) for m/z, chromatogram or IM peak picking. Mainly because different data types require different checks for valid data (e.g., profile data for m/z picking etc.) and the number of implementations might differ for different data modalities.
I agree that PeakPickerHiRes is general enough to be used for ion mobility data. Still, we need to do some pre-processing (for example, resampling and smoothing) to make it applicable for HiRes.
I also want to flag that people may have different objectives for peak picking. My understanding of centroiding is to retain as much data as possible but in a reduced useful format. For instance, what do you do when a mobilogram only contains 2 raw peaks? I believe in the context of centroiding, those peaks should be retained because the overall objective is more downstream analysis (mass trace detection across RT, peptide feature detection, etc). But if a peak picker is to be used for quantification or detection such as the context in OpenSWATH, do we want that hypothetical mobilogram with only 2 raw peaks to be picked?
Since the difference boils down to parameter settings, we can explain this in the documentation for 'PeakPickerIM'. Maybe the PeakPickerIM default parameters should be strict and pick 'good' looking peaks. I think we should have a separate centroiding algorithm that uses 'PeakPickerIM' internally with the setting that allows it to pick nearly all peaks.
My current mobilogram pre-processing setup is linear resampling followed by smoothing with Gaussian. This ensures peakpickerHiRes retains patchy peaks.
- I am having some difficulty how the centroided data differs from the concatenated data.
I am also not 100% sure (ping @cbielow) but my impression was that CONCATENATED is always the full frame stored in a single spectrum via concatenation of individual scans. The concatenated data is much larger than the IM centroided one because IM values in the float data array are repeated for all peaks of a scan. After IM centroiding, each resulting m/z + IM peak can have different IM values and is usually sorted by m/z (I think this is not true for CONCATENATED as basically scans are just added one after the other starting at low m/z values again).
It seems like "concatenated" is how diapysef / msconvert writes PASEF data in mzML format. A single 'MSSpectrum' object is 1 frame whose all ion mobility scans are written as one array sorted by m/z (and ion mobility stored in FloatDataArrays())
- Should the centroiding of IM somehow be tied to m/z centroiding? E.g. would we ever have an instance where IM is centroided and m/z is not centroided?
Excellent point, and I was also wondering about this today. I think we could make a decision here and assume that we only pick in IM dimension if m/z dimension is already picked.
In my approach, the projection of all peaks in one PASEF frame to the m/z dimension addresses this problem. In this step, we merge all ion mobility scans into one array to generate a spectrum that exhibits a raw profile peak shape. Unlike the 'concatenated' PASEF frame, we sum up the intensity of repeated m/z peaks (repeated because they are recorded at different ion mobility values). PeakPickerHiRes identifies the m/z centroid from this step. Subsequently, each centroided m/z peak will have its mobilogram extracted and peak picked. Based on the picked ion mobility peak FWHM, we can revisit the raw data and compute intensity weighted m/z centroid based on its ion mobility boundaries. This is relevant in cases where one m/z centroid ends up with two or even three mobilogram peaks.
- An extension of this slightly off topic is would it make sense to just have a single peak picker under MATH maybe even a new folder of PEAK_PICKING or something like that. This will prevent confusing use cases that previously were used like hacking the PeakPickerHiRes to use it on RT peak picking.
This could be one way. E.g. we could think about having a PeakPickerProfileSpectrum, PeakPickerChromatogram, PeakPickerIMFrame classes in such a folder and move the generic code (e.g. template) from PeakPickerHiRes to PeakPicker class. Or as in the screenshot: in the CENTROIDING folder. This might be cleaner would change some python bindings etc..
I like this idea
I am having some difficulty how the centroided data differs from the concatenated data.
I am also not 100% sure (ping @cbielow) but my impression was that CONCATENATED is always the full frame stored in a single spectrum via concatenation of individual scans. The concatenated data is much larger than the IM centroided one because IM values in the float data array are repeated for all peaks of a scan. After IM centroiding, each resulting m/z + IM peak can have different IM values and is usually sorted by m/z (I think this is not true for CONCATENATED as basically scans are just added one after the other starting at low m/z values again).
It seems like "concatenated" is how diapysef / msconvert writes PASEF data in mzML format. A single 'MSSpectrum' object is 1 frame whose all ion mobility scans are written as one array sorted by m/z (and ion mobility stored in FloatDataArrays())
Ok makes sense seems that the centroided in a way would be processed concatenated data.
- I will today add some class/file stubs to this PR for PeakPickerIM and tests.
- We still need to find a reliable way to determine if data is in centroided format. Easiest would be if we assume that we always do the IM picking so we could just annotate it at the spectrum level. But: if we load mzMLs we don't have this information available, and I did not find a CV term in the PSI-MS ontology to annotate this. In this case we would need to determine it based on the data itself (e.g., if a thid-party tool did the IM centroiding) or based on a custom meta value stored by us in the m zML. Then we could do something similar to peaktypeestimator by looking for either the meta value or if missing at the peak patterns in the data arrays.
@alhigaylan I added a stub for discussion. e.g. there you could also add your code as a separate static member function
Draft is ready from my side and open for comments / suggestions.
[!IMPORTANT]
Review skipped
Draft detected.
Please check the settings in the CodeRabbit UI or the
.coderabbit.yamlfile in this repository. To trigger a single review, invoke the@coderabbitai reviewcommand.You can disable this status message by setting the
reviews.review_statustofalsein the CodeRabbit configuration file.
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
🪧 Tips
Chat
There are 3 ways to chat with CodeRabbit:
- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
I pushed a fix in commit <commit_id>, please review it.Generate unit testing code for this file.Open a follow-up GitHub issue for this discussion.
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitaiin a new review comment at the desired location with your query. Examples:@coderabbitai generate unit testing code for this file.@coderabbitai modularize this function.
- PR comments: Tag
@coderabbitaiin a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.@coderabbitai read src/utils.ts and generate unit testing code.@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.@coderabbitai help me debug CodeRabbit configuration file.
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (Invoked using PR comments)
@coderabbitai pauseto pause the reviews on a PR.@coderabbitai resumeto resume the paused reviews.@coderabbitai reviewto trigger an incremental review. This is useful when automatic reviews are disabled for the repository.@coderabbitai full reviewto do a full review from scratch and review all the files again.@coderabbitai summaryto regenerate the summary of the PR.@coderabbitai generate docstringsto generate docstrings for this PR.@coderabbitai generate sequence diagramto generate a sequence diagram of the changes in this PR.@coderabbitai resolveresolve all the CodeRabbit review comments.@coderabbitai configurationto show the current CodeRabbit configuration for the repository.@coderabbitai helpto get help.
Other keywords and placeholders
- Add
@coderabbitai ignoreanywhere in the PR description to prevent this PR from being reviewed. - Add
@coderabbitai summaryto generate the high-level summary at a specific location in the PR description. - Add
@coderabbitaianywhere in the PR title to generate the title automatically.
CodeRabbit Configuration File (.coderabbit.yaml)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yamlfile to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.
parallelized
Ok wow. Previously I stopped it after 30h. Now finishes in 13 min (80 threads + new optimizations).
I did a quick experiment what happens if we run it on a larger dataset and then feed the output (with picked IM) into OpenSWATH.
We now get:
Progress of 'Retention time normalization':
# mz regression parameters: Y = -8.64506 + 0.00952783 X + -3.77567e-06 X^2
sum residual sq ppm before 186216 / after 116750
# im regression parameters: Y = -0.0183206 + 1.01348 X + 0 X^2
-- done [took 20:08 m (CPU), 31.04 s (Wall)] --
Will analyze 1186380 transitions in total.
Progress of 'Extracting and scoring transitions':
Setting up nested loop with 1 threads out of 40
...
...
Thread 0_21 will analyze 99678 compounds and 598068 transitions from SWATH 2 (batch 33 out of 398)
This should never happen, pr_it has advanced past the master container: 0.93191 / 0.93169
---------------------------------------------------
FATAL: uncaught exception!
---------------------------------------------------
last entry in the exception handler:
exception of type OutOfRange occurred in line 100, function static void OpenMS::IonMobilityScoring::alignToGrid_(const OpenMS::Mobilogram&, const std::vector<double>&, OpenMS::Mobilogram&, double, OpenMS::Size&) of /beegfs/HPCscratch/sachsenb/OpenMS/src/openms/source/ANALYSIS/OPENSWATH/IonMobilityScoring.cpp
error message: the argument was not in range
---------------------------------------------------
..//./scripts/run_osw.sh: line 183: 341508 Aborted
Any ideas?
- Maybe sorting?
- Maybe the extra float data arrays that are not expected (checked, not the case)
Meta data array:
IM FWHM: 4 spectra
Ion Mobility: 4 spectra
MZ FWHM: 4 spectra
Issue seems to be how IonMobilityScoring::alignToGrid_ is expecting the data. If I disable ion mobility scores and it runs through!
Seems to be quite fast now. with the more conservative peak picking by Mohammed: IM peak picking ~ 12 min OSW workflow with library free ~ 21 min (was 2h20 min)
but less sensitive (as it is mainly optimized for MS1 right now): Total peakgroups (q-value<0.01): 2151 (vs 3946) Total peakgroups (q-value<0.05): 2851 (vs 4961)
XGBoost failed with Error: The estimated pi0 <= 0. Check that you have valid p-values or use a different range of lambda. Current lambda range: [0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45]
with simple peak clustering: IM peak picking ~ 12 min OSW workflow with library free ~ 41 min (was 2h20 min)
LDA: Total peakgroups (q-value<0.01): 3159 (vs 3946) Total peakgroups (q-value<0.05): 4057 (vs 4961)
XGBoost: Total peakgroups per run (q-value<0.01): 4337 (vs 4321) Total peakgroups (q-value<0.05): 5666 (vs 5293)
with trace detection: IM peak picking ~ 5 min LDA: Total peakgroups (q-value<0.01): 2199 Total peakgroups (q-value<0.05): 3000
XGBoost: Total peakgroups (q-value<0.01): 5766 Total peakgroups (q-value<0.05): 7814
XGBoost numbers are a bit too good so take this with a grain of salt
Tested EPD. Those were sensitive parameters.
Will implement in PeakPickerIM soon
RT axis = IM
superseeded by new PR #8177