LFQ MBR FDR algorithm needed.
Description of the Feature
During the benchmark of quantms using LFQ and MBR (issues #300 #301 #287) we developed a new probabilistic algorithm based on SVM that control the number of false positives in a better way than previous proteomicsLFQ algorithm (based on number of samples where the feature is found).
However, the current algorithm produces better reliable results issues #301 #287 we should aim in ProteomicsLFQ a better FDR control algorithm that only use one parameter. In addition, would be great to improve the algorithm and feature detection. From my point of view, these are the priorities for that algorithm:
- [ ] Implement an FDR-based approach for MBR reducing the number of parameters.
- [ ] Improve the feature detection, including the possibility to do feature transfer across any msrun in the experiment. I think OpenMS only transfer features across samples in the same condition, however MQ uses all msruns in the experiment, which may be the source of the differences between tools.
- [ ] Implement the MBRs for TMT datasets similar to the following manuscript https://pubs.acs.org/doi/10.1021/acs.jproteome.0c00209
We can discuss the details @timosachsenberg @jpfeuffer @daichengxin.
Command used and terminal output
No response
Relevant files
No response
System information
No response
I think it should transfer between all files of the same fraction number already. I think our settings are a bit conservative to not inflate the transfer FDR too much. A more data driven approach would be great here.
E.g., I could imagine that we could
- determine most similar runs (e.g., aka mapalingertreeguided)
- train classifier on identified target and decoy (mass offset) features to model correct transfer and wrong transfer (to offset feature).
- use classifier in FeatureLinkderUnlabeledQT to annotate linking p-values
- figure out a way how to filter those to attain a global transfer FDR
@jpfeuffer and @cbielow what do you think?
More scalable alternatives would be approaches like IonQuant or Sage.
I think it should transfer between all files of the same fraction number already. I think our settings are a bit conservative to not inflate the transfer FDR too much. A more data driven approach would be great here.
I'm probably wrong but MQ do not care much about fraction identifiers, they do transfer also across fractions. My guess is based on the assumption that MQ do not know what raw file belongs to what fraction.
I'm probably wrong but MQ do not care much about fraction identifiers, they do transfer also across fractions. My guess is based on the assumption that MQ do not know what raw file belongs to what fraction.
Actually, MQ only transfers ID's across fractions which are at most 1 fraction apart. Hence you also have to tell MQ about the fraction number in the experimental design. Of course, if you simply "forget" to annotate fractions in MQ, then it will transfer whatever it can across all runs (and incur a massive false positive rate...)
This is interesting @cbielow. Nice discussion. I have seen a lot of experiments not providing fraction information. Do the FDR algorithm of MQ @cbielow correct that, or the FDR will be inflated (if that is the case, do you have a paper reference or some data to show that?)
I only have very old data (and I would need to dig a lot to find it) and anecdotal evidence.
there is https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7346880/ which does not discuss fractions, but shows that MQ FDR is not kept at bay, unless you enable the MQ LFQ algorithm.
There is also a discussion on the MQ mailing list on this: https://groups.google.com/g/maxquant-list/c/a9bZMUeSE7Y/m/J6Rw174oCAAJ
Even in newer MQ versions, the XML config still has <matchBetweenRunsFdr>False</matchBetweenRunsFdr> by default, with no way of enabling it in the GUI and its hard to find any documentation on the topic. So it seems MQ is not very confident about this and disables it.
There is also https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8131922/ which describes an FDR method, but which can be augmented with more data to make it better IMHO. The paper also uses MQ 1.6 which is rather old.
Good ideas. The problem with the last approach is that it is very costly with our current data structures. I think we would need a binned and indexed representation of an experiment to make this viable (see flashlfq or sage).
And I think we might need to dissect the FFID API to be able to extract single features on demand. Currently it is very focussed on processing a full set of predefined IDs.
Not saying it can't be done. @timosachsenberg and me were just thinking about potentially faster or easier to implement ways
Btw interesting that the lfq algorithm (did not look into detail but think it is maxlfq) seems to correct for some wrong linking. Can probably be seen as a robust summarization method.