OpenMS icon indicating copy to clipboard operation
OpenMS copied to clipboard

Enhancements in MSExperiment and MSSpectrum classes

Open jpfeuffer opened this issue 1 year ago • 15 comments

User description

Description

Adds faster, simpler and customizable extract and aggregate functions to MSExperiment.

TODOs:

  • Decide on final API. Probably templatize in/outputs of aggregation functions, such that you can extract a Chromatogram object instead of just std::vectors or doubles
  • More docs
  • Add support for grouped extractions with a third aggregation function on how to combine extractions in that group (basically enabling full OpenSwath "feature finding" pipeline). Benefits from extra scheduling of group aggregation threads to start aggregating once a group was fully extracted

Checklist

  • [ ] Make sure that you are listed in the AUTHORS file
  • [ ] Add relevant changes and new features to the CHANGELOG file
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] New and existing unit tests pass locally with my changes
  • [ ] Updated or added python bindings for changed or new classes (Tick if no updates were necessary.)

How can I get additional information on failed tests during CI

Click to expand If your PR is failing you can check out
  • The details of the action statuses at the end of the PR or the "Checks" tab.
  • http://cdash.openms.de/index.php?project=OpenMS and look for your PR. Use the "Show filters" capability on the top right to search for your PR number. If you click in the column that lists the failed tests you will get detailed error messages.

Advanced commands (admins / reviewer only)

Click to expand
  • /reformat (experimental) applies the clang-format style changes as additional commit. Note: your branch must have a different name (e.g., yourrepo:feature/XYZ) than the receiving branch (e.g., OpenMS:develop). Otherwise, reformat fails to push.
  • setting the label "NoJenkins" will skip tests for this PR on jenkins (saves resources e.g., on edits that do not affect tests)
  • commenting with rebuild jenkins will retrigger Jenkins-based CI builds

:warning: Note: Once you opened a PR try to minimize the number of pushes to it as every push will trigger CI (automated builds and test) and is rather heavy on our infrastructure (e.g., if several pushes per day are performed).


Type

Enhancement


Description

  • Added new methods for aggregating data in the MSExperiment class. These methods include aggregate functions with various parameters and overloads, getSpectraIdxRangeByRetentionTime, getSpectraIdcsByRetentionTime, getFirstProductSpectrum, and getRangesIdcs_. These methods provide more flexible and efficient ways to aggregate and retrieve data from MSExperiment objects.
  • Introduced OpenMP parallelization in some of the new methods in the MSExperiment class to improve performance.
  • Added a new test case for the aggregate method in the MSExperiment class.
  • Added the maybeGetIMData method in the MSSpectrum class, which returns ion mobility data if available.

Changes walkthrough

Relevant files                                                                                                                                 
Enhancement
MSExperiment.cpp                                                                                       
    src/openms/source/KERNEL/MSExperiment.cpp

    Added new methods for aggregating data in the MSExperiment
    class. These methods include aggregate functions with
    various parameters and overloads,
    getSpectraIdxRangeByRetentionTime,
    getSpectraIdcsByRetentionTime, getFirstProductSpectrum,
    and getRangesIdcs_. Also, OpenMP parallelization was
    introduced in some of the new methods.

+377/-9
MSSpectrum.cpp                                                                                           
    src/openms/source/KERNEL/MSSpectrum.cpp

    Added the maybeGetIMData method in the MSSpectrum class,
    which returns ion mobility data if available.

+13/-0
Tests
MSExperiment_test.cpp                                                                             
    src/tests/class_tests/openms/source/MSExperiment_test.cpp

    Added a new test case for the aggregate method in the
    MSExperiment class.

+78/-0

✨ Usage guide:

Overview: The describe tool scans the PR code changes, and generates a description for the PR - title, type, summary, walkthrough and labels. The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on a PR.

When commenting, to edit configurations related to the describe tool (pr_description section), use the following template:

/describe --pr_description.some_config1=... --pr_description.some_config2=...

With a configuration file, use the following template:

[pr_description]
some_config1=...
some_config2=...
Enabling\disabling automation
  • When you first install the app, the default mode for the describe tool is:
pr_commands = ["/describe --pr_description.add_original_user_description=true" 
                         "--pr_description.keep_original_user_title=true", ...]

meaning the describe tool will run automatically on every PR, will keep the original title, and will add the original user description above the generated description.

  • Markers are an alternative way to control the generated description, to give maximal control to the user. If you set:
pr_commands = ["/describe --pr_description.use_description_markers=true", ...]

the tool will replace every marker of the form pr_agent:marker_name in the PR description with the relevant content, where marker_name is one of the following:

  • type: the PR type.
  • summary: the PR summary.
  • walkthrough: the PR walkthrough.

Note that when markers are enabled, if the original PR description does not contain any markers, the tool will not alter the description at all.

Custom labels

The default labels of the describe tool are quite generic: [Bug fix, Tests, Enhancement, Documentation, Other].

If you specify custom labels in the repo's labels page or via configuration file, you can get tailored labels for your use cases. Examples for custom labels:

  • Main topic:performance - pr_agent:The main topic of this PR is performance
  • New endpoint - pr_agent:A new endpoint was added in this PR
  • SQL query - pr_agent:A new SQL query was added in this PR
  • Dockerfile changes - pr_agent:The PR contains changes in the Dockerfile
  • ...

The list above is eclectic, and aims to give an idea of different possibilities. Define custom labels that are relevant for your repo and use cases. Note that Labels are not mutually exclusive, so you can add multiple label categories. Make sure to provide proper title, and a detailed and well-phrased description for each label, so the tool will know when to suggest it.

Utilizing extra instructions

The describe tool can be configured with extra instructions, to guide the model to a feedback tailored to the needs of your project.

Be specific, clear, and concise in the instructions. With extra instructions, you are the prompter. Notice that the general structure of the description is fixed, and cannot be changed. Extra instructions can change the content or style of each sub-section of the PR description.

Examples for extra instructions:

[pr_description] 
extra_instructions="""
- The PR title should be in the format: '<PR type>: <title>'
- The title should be short and concise (up to 10 words)
- ...
"""

Use triple quotes to write multi-line instructions. Use bullet points to make the instructions more readable.

More PR-Agent commands

To invoke the PR-Agent, add a comment using one of the following commands:

  • /review: Request a review of your Pull Request.
  • /describe: Update the PR title and description based on the contents of the PR.
  • /improve [--extended]: Suggest code improvements. Extended mode provides a higher quality feedback.
  • /ask <QUESTION>: Ask a question about the PR.
  • /update_changelog: Update the changelog based on the PR's contents.
  • /add_docs 💎: Generate docstring for new components introduced in the PR.
  • /generate_labels 💎: Generate labels for the PR based on the PR's contents.
  • /analyze 💎: Automatically analyzes the PR, and presents changes walkthrough for each component.

See the tools guide for more details. To list the possible configuration parameters, add a /config comment.

See the describe usage page for a comprehensive guide on using this tool.

jpfeuffer avatar Jan 14 '24 18:01 jpfeuffer

nice draft! needs some cleanup though and resolving the discussion regarding use of strings (the standard way in popular python libraries) like scikit-learn. Could this string version only be exposed in python (and ideally tested if it still covers the enum set?)

timosachsenberg avatar Jan 15 '24 13:01 timosachsenberg

This is a draft PR that is mainly for discussing the interface and general idea. And also to potentially find some to help with, since I did this over Xmas and won't have so much time in the next weeks. I agree with all of the comments so far. I think I am even removing the branchless binary search since you cannot force clang yet to make it branchless.

jpfeuffer avatar Jan 15 '24 13:01 jpfeuffer

nice draft! needs some cleanup though and resolving the discussion regarding use of strings (the standard way in popular python libraries) like scikit-learn. Could this string version only be exposed in python (and ideally tested if it still covers the enum set?)

We can use an enum in Python. Only exposed in Python? No. Only with conditional compilation and preprocessor macros.

Edit: Well you could maybe use an additional header or dependent library/compile object that is only linked into pyopenms during pyopenms compilation but this gets quite tricky.

jpfeuffer avatar Jan 15 '24 13:01 jpfeuffer

@CodiumAI-Agent /review

timosachsenberg avatar Jan 17 '24 14:01 timosachsenberg

PR Analysis

(review updated until commit https://github.com/OpenMS/OpenMS/commit/6d9a5a3eaab93772ab8e803c466650ffb3459dff)

  • 🎯 Main theme: Adding faster, simpler and customizable extract and aggregate functions to MSExperiment.
  • 📝 PR summary: This PR introduces new extract and aggregate functions to the MSExperiment class. These functions are designed to be faster, simpler, and more customizable than their predecessors. They also include support for parallel processing through OpenMP. The PR also includes updates to relevant tests and Python bindings.
  • 📌 Type of PR: Enhancement
  • 🧪 Relevant tests added: Yes
  • ⏱️ Estimated effort to review [1-5]: 4, because the PR introduces a significant amount of new code, including complex functions and parallel processing. It also modifies several files, including tests and Python bindings.
  • 🔒 Security concerns: No security concerns found

PR Feedback

💡 General suggestions: The PR is well-structured and the new features are a valuable addition to the project. However, there are several areas where the code could be improved. Firstly, there are several commented-out sections of code that should be removed or uncommented with an explanation. Secondly, there are several TODO comments that should be addressed or removed. Finally, the code could benefit from more comments explaining the logic and purpose of the new functions, especially those involving parallel processing and complex calculations.


✨ Usage guide:

Overview: The review tool scans the PR code changes, and generates a PR review. The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on any PR. When commenting, to edit configurations related to the review tool (pr_reviewer section), use the following template:

/review --pr_reviewer.some_config1=... --pr_reviewer.some_config2=...

With a configuration file, use the following template:

[pr_reviewer]
some_config1=...
some_config2=...
Utilizing extra instructions

The review tool can be configured with extra instructions, which can be used to guide the model to a feedback tailored to the needs of your project.

Be specific, clear, and concise in the instructions. With extra instructions, you are the prompter. Specify the relevant sub-tool, and the relevant aspects of the PR that you want to emphasize.

Examples for extra instructions:

[pr_reviewer] # /review #
extra_instructions="""
In the code feedback section, emphasize the following:
- Does the code logic cover relevant edge cases?
- Is the code logic clear and easy to understand?
- Is the code logic efficient?
...
"""

Use triple quotes to write multi-line instructions. Use bullet points to make the instructions more readable.

How to enable\disable automation
  • When you first install PR-Agent app, the default mode for the review tool is:
pr_commands = ["/review", ...]

meaning the review tool will run automatically on every PR, with the default configuration. Edit this field to enable/disable the tool, or to change the used configurations

About the 'Code feedback' section

The review tool provides several type of feedbacks, one of them is code suggestions. If you are interested only in the code suggestions, it is recommended to use the improve feature instead, since it dedicated only to code suggestions, and usually gives better results. Use the review tool if you want to get a more comprehensive feedback, which includes code suggestions as well.

Auto-labels

The review tool can auto-generate two specific types of labels for a PR:

  • a possible security issue label, that detects possible security issues (enable_review_labels_security flag)
  • a Review effort [1-5]: x label, where x is the estimated effort to review the PR (enable_review_labels_effort flag)
Extra sub-tools

The review tool provides a collection of possible feedbacks about a PR. It is recommended to review the possible options, and choose the ones relevant for your use case. Some of the feature that are disabled by default are quite useful, and should be considered for enabling. For example: require_score_review, require_soc2_review, enable_review_labels_effort, and more.

More PR-Agent commands

To invoke the PR-Agent, add a comment using one of the following commands:

  • /review: Request a review of your Pull Request.
  • /describe: Update the PR title and description based on the contents of the PR.
  • /improve [--extended]: Suggest code improvements. Extended mode provides a higher quality feedback.
  • /ask <QUESTION>: Ask a question about the PR.
  • /update_changelog: Update the changelog based on the PR's contents.
  • /add_docs 💎: Generate docstring for new components introduced in the PR.
  • /generate_labels 💎: Generate labels for the PR based on the PR's contents.
  • /analyze 💎: Automatically analyzes the PR, and presents changes walkthrough for each component.

See the tools guide for more details. To list the possible configuration parameters, add a /config comment.

See the review usage page for a comprehensive guide on using this tool.

QodoAI-Agent avatar Jan 17 '24 14:01 QodoAI-Agent

Persistent review updated to latest commit https://github.com/OpenMS/OpenMS/commit/6d9a5a3eaab93772ab8e803c466650ffb3459dff

QodoAI-Agent avatar Jan 17 '24 14:01 QodoAI-Agent

/describe

timosachsenberg avatar Jan 18 '24 12:01 timosachsenberg

PR Description updated to latest commit (https://github.com/OpenMS/OpenMS/commit/6d9a5a3eaab93772ab8e803c466650ffb3459dff)

github-actions[bot] avatar Jan 18 '24 12:01 github-actions[bot]

/add_docs

timosachsenberg avatar Jan 18 '24 12:01 timosachsenberg

quite busy but can give it a quick look to push this forward

timosachsenberg avatar Jan 31 '24 07:01 timosachsenberg

Right now it seems we combine RTBegin with areaiterators that can restrict to a ms_level. Would it make sense to also have a RTBegin version that considers the ms level?

Another question: does anyone (e.g. @cbielow , @jcharkow etc.) know if these interfaces still work if we, e.g., want to retrieve e.g., IM frames etc.? Not a showstopper, but maybe important to consider here?

timosachsenberg avatar Jan 31 '24 09:01 timosachsenberg

I think we could add an an IM range but it will be quite slow since it will go through a spectrum linearly and check all IM values in the floatarray (for IM data that is annotated like this at least; PASEF?). It will be very hard to keep this general enough if IM info can also be attached somewhere else (e.g. at the spectrum level; for FAIMS?). Not an expert.

jpfeuffer avatar Jan 31 '24 09:01 jpfeuffer

I still think that subclasses like TIMSExperiment, FAIMSExperiment, MSExperiment would be worthwhile for efficient handling. But I remember that there were concerns from other sides. They guarantee that things are sorted in a certain way. Maybe even have additional members that are not metavalues anymore, etc.

jpfeuffer avatar Jan 31 '24 09:01 jpfeuffer

hmm I think subclassing could introduce a few complications here (vector inheritance, currently final, ... ) so I would love to prevent this if possible.

Two primary data types are exposed through TimsRust:

e.g. timsrust exposes only:

Spectra: A traditional representation that expresses intensitites in function of mz values for a given precursor.
Frames: All recorded data from a single TIMS elution (i.e. at one specific retention_time).

So in principle, a frame could just be stored in a spectrum with the float data arrays (and annotated as frame - maybe it is alrady done that way?). As you point out, this might make some operations a bit slow. Looking at https://github.com/OpenMS/OpenMS/blob/develop/src/openms/include/OpenMS/IONMOBILITY/IMDataConverter.h#L59 it seems that we would probably consume those frames/spectra and do some processing like binning and create one MSExperiment from those. Creating one MSExperiment might impose some overhead, though. Not sure if this is an issue, but my guess the overhead is okish @cbielow .

timosachsenberg avatar Jan 31 '24 09:01 timosachsenberg

Yes, looks like they are using a sparse offset representation though. Not a dense float array representing every peak.

jpfeuffer avatar Jan 31 '24 11:01 jpfeuffer