ms2pip icon indicating copy to clipboard operation
ms2pip copied to clipboard

Training new models

Open prvst opened this issue 1 year ago • 3 comments

Hello, is there any documentation detailing how to train new models?

prvst avatar Sep 09 '24 18:09 prvst

Hi @prvst,

Unfortunately, not yet. However, if you have some machine learning experience, it should not be too hard to try.

The ms2pip get-training-data command can be used to generate features and targets for training. Then the XGBoost and hyperopt Python packages can be used to optimize hyperparameters and to train new models. Note that the same features are used for each target ion type.

From the most recent MS²PIP paper supplementary:

All models were trained with the XGBoost machine learning algorithm (20) and hyperparameter optimization was performed with the Hyperopt (21) Python package using a four-fold cross-validation evaluation scheme. The maximal number of boosting rounds was fixed at 400 and early stopping was set to 10 boosting rounds. The selected hyperparameters are listed on supplemental Table S2.

Table S2. - The optimal hyperparameters for each new b- and y-ion MS²PIP model, as determined during hyperparameter optimization.

Model Eta Max depth Grow policy Max leaves Min child weight Gamma Lambda Alpha Colsample by tree Sub-sample
HCD 2021 (b-ions) 0.08060612330262913 18 Lossguide 117 500 0.031142279181653326 0.2724553826622634 3.4 0.891381182690278 0.7
HCD 2021 (y-ions) 0.047107785048838 18 Lossguide 490 4 0.37528441949267444 0.35150807248415 3.3 0.6122042447952851 0.6
Immunopeptide HCD (b-ions) 0.09263630381479264 17 Lossguide 131 16 0.6048882172751935 0.9332236183206803 4.6 0.9898165069470042 0.7
Immunopeptide HCD (y-ions) 0.0594145790364741 17 Lossguide 302 3 0.03338151150211477 0.4430375595950531 4.5 0.9389820388602939 0.7
CID-TMT (b-ions) 0.09788304115318931 16 Lossguide 100 175 0.36436201158266845 0 3.1 0.9307205074180112 0.8
CID-TMT (y-ions) 0.07323226418651792 15 Lossguide 15 84 0.06487830003469364 0 0.7 0.7980941914509116 0.7

Once you have new XGBoost models for each ion type, they can be saved to a file in your ~/.ms2pip directory and added to the ms2pip.constants.MODELS dictionary. Then they should be available for usage.

Do let us know if the models you have in mind would be of interest to the wider community. In that case, we could definitely consider shipping the models with MS²PIP.

Best, Ralf

RalfG avatar Oct 01 '24 07:10 RalfG

Thanks! Can this be used for the training? train_xgboost_c.py

prvst avatar Oct 01 '24 16:10 prvst

That script is mostly out of date and should be removed or updated. Nevertheless, it could be of help as a template. Mostly all parts referring to C code can be ignored.

RalfG avatar Nov 03 '24 21:11 RalfG