OpenMS icon indicating copy to clipboard operation
OpenMS copied to clipboard

do not merge: Fix iBAQ normalization silently skipped for protein groups (issue #8466)

Open timosachsenberg opened this issue 3 weeks ago • 5 comments

The root cause could be an accession mismatch between prot_quant_ keys and ProteinHit accessions:

  • transferPeptideDataToProteins_() populates prot_quant_ using LEADER accessions from getAccession_() + mapAccessionToLeader()
  • performIbaqNormalization_() iterated over protein hits directly and looked up hit.getAccession() in prot_quant_
  • When protein groups exist, non-leader hits would fail the lookup, silently skipping iBAQ normalization

The fix changes performIbaqNormalization_() to:

  1. Build the same accession_to_leader mapping used elsewhere
  2. Create a reverse leader_to_members mapping
  3. Iterate over prot_quant_ entries (keyed by leaders) instead of hits
  4. For each leader, find a sequence from the leader hit or any member

This ensures iBAQ normalization is applied correctly even when:

  • Leader accession differs from protein hit accessions
  • Only group member hits have sequences
  • Complex protein inference produces group structures

Added two test cases that expose the bug:

  • iBAQ with indistinguishable protein groups (leader present)
  • iBAQ with missing leader protein hit (only members present)

Description

Checklist

  • [ ] Make sure that you are listed in the AUTHORS file
  • [ ] Add relevant changes and new features to the CHANGELOG file
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] New and existing unit tests pass locally with my changes
  • [ ] Updated or added python bindings for changed or new classes (Tick if no updates were necessary.)

How can I get additional information on failed tests during CI

Click to expand If your PR is failing you can check out
  • The details of the action statuses at the end of the PR or the "Checks" tab.
  • http://cdash.seqan.de/index.php?project=OpenMS and look for your PR. Use the "Show filters" capability on the top right to search for your PR number. If you click in the column that lists the failed tests you will get detailed error messages.

Advanced commands (admins / reviewer only)

Click to expand
  • /reformat (experimental) applies the clang-format style changes as additional commit. Note: your branch must have a different name (e.g., yourrepo:feature/XYZ) than the receiving branch (e.g., OpenMS:develop). Otherwise, reformat fails to push.
  • setting the label "NoJenkins" will skip tests for this PR on jenkins (saves resources e.g., on edits that do not affect tests)
  • commenting with rebuild jenkins will retrigger Jenkins-based CI builds

:warning: Note: Once you opened a PR try to minimize the number of pushes to it as every push will trigger CI (automated builds and test) and is rather heavy on our infrastructure (e.g., if several pushes per day are performed).

timosachsenberg avatar Dec 11 '25 12:12 timosachsenberg

[!IMPORTANT]

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment
  • [ ] Commit unit tests in branch claude/fix-ibaq-accession-mismatch-01KPXTWvaU3A5GDhcNdL7JER

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Dec 11 '25 12:12 coderabbitai[bot]

How the Mismatch Happens in ProteomicsLFQ

In a typical ProteomicsLFQ workflow, the accession mismatch occurs during the protein inference step.


1. Initial Peptide-to-Protein Mapping (PeptideIndexer)

Peptide AAAK → {sp|P12345|PROTA, sp|P67890|PROTB} Peptide CCCK → {sp|P12345|PROTA, sp|P67890|PROTB}

Both proteins share the same peptides → they are indistinguishable.


2. Protein Inference (Epifany / BayesianProteinInference)

Creates indistinguishable protein groups:

ProteinGroup {
  accessions: ["sp|P12345|PROTA", "sp|P67890|PROTB"],  // Leader is first
  probability: 0.95
}
The leader (sp|P12345|PROTA) is chosen based on scoring or alphabetical order.

3. What Gets Stored in ConsensusXML
ProteinHits (may or may not contain all group members):

<ProteinHit accession="sp|P67890|PROTB" sequence="AAAKCCCK..." />
<!-- Leader sp|P12345|PROTA might be missing or have no sequence! -->
IndistinguishableProteins (group structure):

<UserParam name="indistinguishable_proteins_0"
           value="0.95,sp|P12345|PROTA,sp|P67890|PROTB"/>
4. The Bug in ProteinQuantifier / iBAQ
Step A — transferPeptideDataToProteins_()
// Peptide references "sp|P67890|PROTB"
// mapAccessionToLeader() returns: "sp|P67890|PROTB" → "sp|P12345|PROTA"
// getAccession_() returns the LEADER

prot_quant_["sp|P12345|PROTA"].total_abundances[0] = 1234.0; 
// Stored under LEADER
Step B — performIbaqNormalization_() (OLD BUGGY CODE)

for (auto& hit : proteins.getHits())  // Only sees "sp|P67890|PROTB"
{
  // hit.getAccession() = "sp|P67890|PROTB"
  if (prot_quant_.find("sp|P67890|PROTB") != end())  // FALSE! Key is leader
  {
    // NEVER EXECUTED → normalization skipped
  }
}

timosachsenberg avatar Dec 11 '25 12:12 timosachsenberg

@jpfeuffer that was the AI explanation.

timosachsenberg avatar Dec 11 '25 12:12 timosachsenberg

I agree it is safer but do you agree that during normal usage (especially in proteomicslfq which was mentioned in the reported issue) the hits should always be a strict be a superset of the leader accessions and therefore the lookup should never fail? That is how inference is performed.

I also am not sure why the bot would need a reverse map? Why would one need to quantify or normalize the group members?? There should be no quant values available for them.

jpfeuffer avatar Dec 11 '25 12:12 jpfeuffer

I agree and don't know why it could be no superset. So we should do something like:

iBAQ = Σ Intensity(unique peptides assigned to protein group leader)
       ---------------------------------------------------------
            N_theoretical(peptides for protein group leader)

timosachsenberg avatar Dec 11 '25 12:12 timosachsenberg