Uncaught exception in DeepLCFeatureGenerator if not enough peptides for calibration set
I'm getting an uncaught exception when trying to use ms2rescore.feature_generators.ms2pip.DeepLCFeatureGenerator. The error happens when there are not enough peptides in psm_list for the calibration set.
Here's how I create the environment:
C:\python\python309\python.exe -m venv venv_309_ms2rescore
venv_309_ms2rescore\Scripts\pip3 install ms2rescore==3.0.2
I'm calling the feature generator as instructed in MS2Rescore docs:
fgen = DeepLCFeatureGenerator(
lower_score_is_better=True, # because we use expect value as 'score'
spectrum_path=None, # not relevant
processes=processes,
deeplc_retrain=False,
calibration_set_size=0.15,
)
fgen.add_features(psm_list)
When there are only a few items in psm_list, there's an uncaught exception:
2024-03-22 11:17:35,204 INFO Running DeepLC for PSMs from run (1/1): `F981141_1.tsv9ig132dw.mgf`...
Traceback (most recent call last):
File "C:\Users\villek\githead\mascot-proj\mascot\www\bin\ML_adapters\MS2RescoreAdapter.py", line 243, in <module>
main()
File "C:\Users\villek\githead\mascot-proj\mascot\www\bin\ML_adapters\MS2RescoreAdapter.py", line 218, in main
_add_DeepLC_features(
File "C:\Users\villek\githead\mascot-proj\mascot\www\bin\ML_adapters\MS2RescoreAdapter.py", line 126, in _add_DeepLC_features
fgen.add_features(psm_list)
File "C:\Users\villek\tmp\venv_309_ms2rescore\lib\site-packages\ms2rescore\feature_generators\deeplc.py", line 163, in add_features
seq_df=self._psm_list_to_deeplc_peprec(psm_list_calibration)
File "C:\Users\villek\tmp\venv_309_ms2rescore\lib\site-packages\ms2rescore\feature_generators\deeplc.py", line 211, in _psm_list_to_deeplc_peprec
peprec = peprec.rename(
File "C:\Users\villek\tmp\venv_309_ms2rescore\lib\site-packages\pandas\core\frame.py", line 3813, in __getitem__
indexer = self.columns._get_indexer_strict(key, "columns")[1]
File "C:\Users\villek\tmp\venv_309_ms2rescore\lib\site-packages\pandas\core\indexes\base.py", line 6070, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "C:\Users\villek\tmp\venv_309_ms2rescore\lib\site-packages\pandas\core\indexes\base.py", line 6130, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['tr', 'seq', 'modifications'], dtype='object')] are in the [columns]"
The workaround in my script is to pass calibration_set_size=1.0 when round(calibration_set_size * len(psm_list[~psm_list['is_decoy']])) == 0. Then _psm_list_to_deeplc_peprec() gets a non-empty array and all is fine. Quite likely I shouldn't even use DeepLC if there aren't enough peptide matches!
Hi, @vrkosk,
Thanks for reporting! We will look into this.
Best, Ralf
For internal reference:
_psm_list_to_deeplc_peprec() has already been removed in the timsRescore branch in favor of sending the PSMList directly to DeepLC. However, we should still look into how this behaves when there are not enough PSMs (or none) for calibration.
Uncaught exception was fixed with the DeepLCFeatureGenerator refactoring in https://github.com/compomics/ms2rescore/commit/8749ddfa8a4fac45ef5a59b02a821b9fd3ca7695. Better logging and exception raising if not enough PSMs are present was implemented in https://github.com/compomics/ms2rescore/commit/4a5ff63ab0eacea5f4ebd8c533f5a74bb80d5abe. Both have been merged with #122.