Enhancements in MSExperiment and MSSpectrum classes
User description
Description
Adds faster, simpler and customizable extract and aggregate functions to MSExperiment.
TODOs:
- Decide on final API. Probably templatize in/outputs of aggregation functions, such that you can extract a Chromatogram object instead of just std::vectors or doubles
- More docs
- Add support for grouped extractions with a third aggregation function on how to combine extractions in that group (basically enabling full OpenSwath "feature finding" pipeline). Benefits from extra scheduling of group aggregation threads to start aggregating once a group was fully extracted
Checklist
- [ ] Make sure that you are listed in the AUTHORS file
- [ ] Add relevant changes and new features to the CHANGELOG file
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] New and existing unit tests pass locally with my changes
- [ ] Updated or added python bindings for changed or new classes (Tick if no updates were necessary.)
How can I get additional information on failed tests during CI
Click to expand
If your PR is failing you can check out- The details of the action statuses at the end of the PR or the "Checks" tab.
- http://cdash.openms.de/index.php?project=OpenMS and look for your PR. Use the "Show filters" capability on the top right to search for your PR number. If you click in the column that lists the failed tests you will get detailed error messages.
Advanced commands (admins / reviewer only)
Click to expand
/reformat(experimental) applies the clang-format style changes as additional commit. Note: your branch must have a different name (e.g., yourrepo:feature/XYZ) than the receiving branch (e.g., OpenMS:develop). Otherwise, reformat fails to push.- setting the label "NoJenkins" will skip tests for this PR on jenkins (saves resources e.g., on edits that do not affect tests)
- commenting with
rebuild jenkinswill retrigger Jenkins-based CI builds
:warning: Note: Once you opened a PR try to minimize the number of pushes to it as every push will trigger CI (automated builds and test) and is rather heavy on our infrastructure (e.g., if several pushes per day are performed).
Type
Enhancement
Description
- Added new methods for aggregating data in the MSExperiment class. These methods include
aggregatefunctions with various parameters and overloads,getSpectraIdxRangeByRetentionTime,getSpectraIdcsByRetentionTime,getFirstProductSpectrum, andgetRangesIdcs_. These methods provide more flexible and efficient ways to aggregate and retrieve data from MSExperiment objects. - Introduced OpenMP parallelization in some of the new methods in the MSExperiment class to improve performance.
- Added a new test case for the
aggregatemethod in the MSExperiment class. - Added the
maybeGetIMDatamethod in the MSSpectrum class, which returns ion mobility data if available.
Changes walkthrough
| Relevant files | |||||
|---|---|---|---|---|---|
| Enhancement |
| ||||
| Tests |
|
✨ Usage guide:
Overview:
The describe tool scans the PR code changes, and generates a description for the PR - title, type, summary, walkthrough and labels. The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on a PR.
When commenting, to edit configurations related to the describe tool (pr_description section), use the following template:
/describe --pr_description.some_config1=... --pr_description.some_config2=...
With a configuration file, use the following template:
[pr_description]
some_config1=...
some_config2=...
Enabling\disabling automation
meaning the
the tool will replace every marker of the form
Note that when markers are enabled, if the original PR description does not contain any markers, the tool will not alter the description at all. |
Custom labelsThe default labels of the If you specify custom labels in the repo's labels page or via configuration file, you can get tailored labels for your use cases. Examples for custom labels:
The list above is eclectic, and aims to give an idea of different possibilities. Define custom labels that are relevant for your repo and use cases. Note that Labels are not mutually exclusive, so you can add multiple label categories. Make sure to provide proper title, and a detailed and well-phrased description for each label, so the tool will know when to suggest it. |
Utilizing extra instructionsThe Be specific, clear, and concise in the instructions. With extra instructions, you are the prompter. Notice that the general structure of the description is fixed, and cannot be changed. Extra instructions can change the content or style of each sub-section of the PR description. Examples for extra instructions:
Use triple quotes to write multi-line instructions. Use bullet points to make the instructions more readable. |
More PR-Agent commands
|
See the describe usage page for a comprehensive guide on using this tool.
nice draft! needs some cleanup though and resolving the discussion regarding use of strings (the standard way in popular python libraries) like scikit-learn. Could this string version only be exposed in python (and ideally tested if it still covers the enum set?)
This is a draft PR that is mainly for discussing the interface and general idea. And also to potentially find some to help with, since I did this over Xmas and won't have so much time in the next weeks. I agree with all of the comments so far. I think I am even removing the branchless binary search since you cannot force clang yet to make it branchless.
nice draft! needs some cleanup though and resolving the discussion regarding use of strings (the standard way in popular python libraries) like scikit-learn. Could this string version only be exposed in python (and ideally tested if it still covers the enum set?)
We can use an enum in Python. Only exposed in Python? No. Only with conditional compilation and preprocessor macros.
Edit: Well you could maybe use an additional header or dependent library/compile object that is only linked into pyopenms during pyopenms compilation but this gets quite tricky.
@CodiumAI-Agent /review
PR Analysis
(review updated until commit https://github.com/OpenMS/OpenMS/commit/6d9a5a3eaab93772ab8e803c466650ffb3459dff)
- 🎯 Main theme: Adding faster, simpler and customizable extract and aggregate functions to MSExperiment.
- 📝 PR summary: This PR introduces new extract and aggregate functions to the MSExperiment class. These functions are designed to be faster, simpler, and more customizable than their predecessors. They also include support for parallel processing through OpenMP. The PR also includes updates to relevant tests and Python bindings.
- 📌 Type of PR: Enhancement
- 🧪 Relevant tests added: Yes
- ⏱️ Estimated effort to review [1-5]: 4, because the PR introduces a significant amount of new code, including complex functions and parallel processing. It also modifies several files, including tests and Python bindings.
- 🔒 Security concerns: No security concerns found
PR Feedback
💡 General suggestions: The PR is well-structured and the new features are a valuable addition to the project. However, there are several areas where the code could be improved. Firstly, there are several commented-out sections of code that should be removed or uncommented with an explanation. Secondly, there are several TODO comments that should be addressed or removed. Finally, the code could benefit from more comments explaining the logic and purpose of the new functions, especially those involving parallel processing and complex calculations.
✨ Usage guide:
Overview:
The review tool scans the PR code changes, and generates a PR review. The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on any PR.
When commenting, to edit configurations related to the review tool (pr_reviewer section), use the following template:
/review --pr_reviewer.some_config1=... --pr_reviewer.some_config2=...
With a configuration file, use the following template:
[pr_reviewer]
some_config1=...
some_config2=...
Utilizing extra instructionsThe Be specific, clear, and concise in the instructions. With extra instructions, you are the prompter. Specify the relevant sub-tool, and the relevant aspects of the PR that you want to emphasize. Examples for extra instructions:
Use triple quotes to write multi-line instructions. Use bullet points to make the instructions more readable. |
How to enable\disable automation
meaning the |
About the 'Code feedback' sectionThe |
Auto-labelsThe
|
Extra sub-toolsThe |
More PR-Agent commands
|
See the review usage page for a comprehensive guide on using this tool.
Persistent review updated to latest commit https://github.com/OpenMS/OpenMS/commit/6d9a5a3eaab93772ab8e803c466650ffb3459dff
/describe
PR Description updated to latest commit (https://github.com/OpenMS/OpenMS/commit/6d9a5a3eaab93772ab8e803c466650ffb3459dff)
/add_docs
quite busy but can give it a quick look to push this forward
Right now it seems we combine RTBegin with areaiterators that can restrict to a ms_level. Would it make sense to also have a RTBegin version that considers the ms level?
Another question: does anyone (e.g. @cbielow , @jcharkow etc.) know if these interfaces still work if we, e.g., want to retrieve e.g., IM frames etc.? Not a showstopper, but maybe important to consider here?
I think we could add an an IM range but it will be quite slow since it will go through a spectrum linearly and check all IM values in the floatarray (for IM data that is annotated like this at least; PASEF?). It will be very hard to keep this general enough if IM info can also be attached somewhere else (e.g. at the spectrum level; for FAIMS?). Not an expert.
I still think that subclasses like TIMSExperiment, FAIMSExperiment, MSExperiment would be worthwhile for efficient handling. But I remember that there were concerns from other sides. They guarantee that things are sorted in a certain way. Maybe even have additional members that are not metavalues anymore, etc.
hmm I think subclassing could introduce a few complications here (vector inheritance, currently final, ... ) so I would love to prevent this if possible.
Two primary data types are exposed through TimsRust:
e.g. timsrust exposes only:
Spectra: A traditional representation that expresses intensitites in function of mz values for a given precursor.
Frames: All recorded data from a single TIMS elution (i.e. at one specific retention_time).
So in principle, a frame could just be stored in a spectrum with the float data arrays (and annotated as frame - maybe it is alrady done that way?). As you point out, this might make some operations a bit slow. Looking at https://github.com/OpenMS/OpenMS/blob/develop/src/openms/include/OpenMS/IONMOBILITY/IMDataConverter.h#L59 it seems that we would probably consume those frames/spectra and do some processing like binning and create one MSExperiment from those. Creating one MSExperiment might impose some overhead, though. Not sure if this is an issue, but my guess the overhead is okish @cbielow .
Yes, looks like they are using a sparse offset representation though. Not a dense float array representing every peak.