do not merge: Fix iBAQ normalization silently skipped for protein groups (issue #8466)
The root cause could be an accession mismatch between prot_quant_ keys and ProteinHit accessions:
- transferPeptideDataToProteins_() populates prot_quant_ using LEADER accessions from getAccession_() + mapAccessionToLeader()
- performIbaqNormalization_() iterated over protein hits directly and looked up hit.getAccession() in prot_quant_
- When protein groups exist, non-leader hits would fail the lookup, silently skipping iBAQ normalization
The fix changes performIbaqNormalization_() to:
- Build the same accession_to_leader mapping used elsewhere
- Create a reverse leader_to_members mapping
- Iterate over prot_quant_ entries (keyed by leaders) instead of hits
- For each leader, find a sequence from the leader hit or any member
This ensures iBAQ normalization is applied correctly even when:
- Leader accession differs from protein hit accessions
- Only group member hits have sequences
- Complex protein inference produces group structures
Added two test cases that expose the bug:
- iBAQ with indistinguishable protein groups (leader present)
- iBAQ with missing leader protein hit (only members present)
Description
Checklist
- [ ] Make sure that you are listed in the AUTHORS file
- [ ] Add relevant changes and new features to the CHANGELOG file
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] New and existing unit tests pass locally with my changes
- [ ] Updated or added python bindings for changed or new classes (Tick if no updates were necessary.)
How can I get additional information on failed tests during CI
Click to expand
If your PR is failing you can check out- The details of the action statuses at the end of the PR or the "Checks" tab.
- http://cdash.seqan.de/index.php?project=OpenMS and look for your PR. Use the "Show filters" capability on the top right to search for your PR number. If you click in the column that lists the failed tests you will get detailed error messages.
Advanced commands (admins / reviewer only)
Click to expand
/reformat(experimental) applies the clang-format style changes as additional commit. Note: your branch must have a different name (e.g., yourrepo:feature/XYZ) than the receiving branch (e.g., OpenMS:develop). Otherwise, reformat fails to push.- setting the label "NoJenkins" will skip tests for this PR on jenkins (saves resources e.g., on edits that do not affect tests)
- commenting with
rebuild jenkinswill retrigger Jenkins-based CI builds
:warning: Note: Once you opened a PR try to minimize the number of pushes to it as every push will trigger CI (automated builds and test) and is rather heavy on our infrastructure (e.g., if several pushes per day are performed).
[!IMPORTANT]
Review skipped
Auto reviews are disabled on base/target branches other than the default branch.
Please check the settings in the CodeRabbit UI or the
.coderabbit.yamlfile in this repository. To trigger a single review, invoke the@coderabbitai reviewcommand.You can disable this status message by setting the
reviews.review_statustofalsein the CodeRabbit configuration file.
✨ Finishing touches
🧪 Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
- [ ] Commit unit tests in branch
claude/fix-ibaq-accession-mismatch-01KPXTWvaU3A5GDhcNdL7JER
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
How the Mismatch Happens in ProteomicsLFQ
In a typical ProteomicsLFQ workflow, the accession mismatch occurs during the protein inference step.
1. Initial Peptide-to-Protein Mapping (PeptideIndexer)
Peptide AAAK → {sp|P12345|PROTA, sp|P67890|PROTB} Peptide CCCK → {sp|P12345|PROTA, sp|P67890|PROTB}
Both proteins share the same peptides → they are indistinguishable.
2. Protein Inference (Epifany / BayesianProteinInference)
Creates indistinguishable protein groups:
ProteinGroup {
accessions: ["sp|P12345|PROTA", "sp|P67890|PROTB"], // Leader is first
probability: 0.95
}
The leader (sp|P12345|PROTA) is chosen based on scoring or alphabetical order.
3. What Gets Stored in ConsensusXML
ProteinHits (may or may not contain all group members):
<ProteinHit accession="sp|P67890|PROTB" sequence="AAAKCCCK..." />
<!-- Leader sp|P12345|PROTA might be missing or have no sequence! -->
IndistinguishableProteins (group structure):
<UserParam name="indistinguishable_proteins_0"
value="0.95,sp|P12345|PROTA,sp|P67890|PROTB"/>
4. The Bug in ProteinQuantifier / iBAQ
Step A — transferPeptideDataToProteins_()
// Peptide references "sp|P67890|PROTB"
// mapAccessionToLeader() returns: "sp|P67890|PROTB" → "sp|P12345|PROTA"
// getAccession_() returns the LEADER
prot_quant_["sp|P12345|PROTA"].total_abundances[0] = 1234.0;
// Stored under LEADER
Step B — performIbaqNormalization_() (OLD BUGGY CODE)
for (auto& hit : proteins.getHits()) // Only sees "sp|P67890|PROTB"
{
// hit.getAccession() = "sp|P67890|PROTB"
if (prot_quant_.find("sp|P67890|PROTB") != end()) // FALSE! Key is leader
{
// NEVER EXECUTED → normalization skipped
}
}
@jpfeuffer that was the AI explanation.
I agree it is safer but do you agree that during normal usage (especially in proteomicslfq which was mentioned in the reported issue) the hits should always be a strict be a superset of the leader accessions and therefore the lookup should never fail? That is how inference is performed.
I also am not sure why the bot would need a reverse map? Why would one need to quantify or normalize the group members?? There should be no quant values available for them.
I agree and don't know why it could be no superset. So we should do something like:
iBAQ = Σ Intensity(unique peptides assigned to protein group leader)
---------------------------------------------------------
N_theoretical(peptides for protein group leader)