Implementation of PLS-DA and OPLS-DA for high dimensional data, like MS in metabolomics.


This package implements PLS-DA and OPLS-DA for analysis of high-dimensional data derived from, for example, mass spectrometry in metabolomics. The visualization of score plots, S-plot, jack-knife confidence intervals for loading profile, and mis-classification number in cross validation are also implemented.


This package is created by Python 3.7, with the following packages required:

numpy 1.17.2
scipy 1.3.1
matplotlib 3.1.3
tqdm 4.64.0

All of these or newer version packages can be installed by using pip.


This package is only workable for binary classifications. Thus, if three or more classes are in the data, this package can't handle that. An alternative way is pair-wise classifications. As Prof. Richard G. Brereton pointed out in his paper[1], binary classification is recommended for PLS related methods, and multi-class classification problems are not suitable for PLS.


The latest release can be downloaded here. Uncompress the package and set Python working directory there. Since current version is not packaged, all modules must be run under the working directory.

Running the codes

# import cross validation module
import cross_validation
# import plotting functions
import plotting
  • Initialize cross validation object for 10-fold cross validation using OPLS-DA.
    cv = cross_validation.CrossValidation(kfold=10, estimator="opls")

    kfold: Fold in cross validation. For leave-one-out cross validation, set it to n, is the number of samples.
    estimator: The classifier, valid values are opls and pls. Defaults to opls.
    scaler: scaling of variable matrix.

    • uv: zero mean and unit variance scaling.
    • pareto: Pareto scaling. This is the default.
    • minmax: min-max scaling so that the range for each variable is between 0 and 1.
    • mean: zero mean scaling.
  • Fit the model., labels)

    X is the variable matrix with size n (rows) by p (columns), where n is number of samples and p is number of variables. labels can be numeric values or strings, with number of elements equals to n.

  • Permutation test [5, 6]

    To identify whether the constructed model is overfitting, permutation test is generally applied, by repeatedly simply randomizing labels and performing the model construction and prediction on the randomized labels many times. This package adopts same strategy, which uses


    num_perms: Number of permutations. Defaults to 10000.
    To get p value, the significance of the constructed model, run


    "q2": Q2.
    "error": Mis-classification error rate.

    p value is calculated as [7]
             p = (No. of permutation error rate <= normal error rate + 1) / (n + 1)
    if misclassification rate (i.e., parameter error) is used as the metric, or
             p = (No. of permutation Q2 >= normal Q2 + 1) / (n + 1)
    if Q2 (i.e., parameter q2) is used, and n is the number of permutations.

  • Visualization of results.
    # construct the plotting object
    plots = plotting.Plots(cv)
    • Number of mis-classifications at different principal components:

    • Cross validated score plot:


      For OPLS-DA, predictive scores tp vs the first orthogonal scores to will be shown; for PLS, the first and second component will be shown.

    • S-plot (only suitable for OPLS-DA).

    • Loading profile with Jack-knife confidence intervals (only suitable for OPLS-DA).

      means, intervals = plots.jackknife_loading_plot(alpha=0.05)

      Where alpha is significance level, default is 0.05. means are mean loadings, and intervals are Jack-knife confidence intervals.

    • Permutation test plot



      • "q2": Q2.
      • "error": Mis-classification error rate.

      Two subplots will be generated to show the permutation test results:

      • [x] Correlation of permuted y to original y vs Model metric.
      • [x] Distribution of permutation model metric which is used to calculate p value.

    For all above plots, set save_plot=True and file_name=some_string.png can save each plot to some_string.png with dpi=1200.

  1. Model assessment.
    # R2X
    # Q2
    # R2y
    # Number of mis-classifications
    To check the R2X and R2y of the optimal component, i.e., cv.optimal_component_num, call cv.R2X and cv.R2y.
  2. Access other metrics.
    • Cross validated predictive scores: cv.scores
    • Cross validated predictive loadings: cv.loadings_cv
    • Optimal number of components determined by cross validation: cv.optimal_component_num
  3. Prediction of new data.
    predicted_scores = cv.predict(X, return_scores=False)
    To predict the class, use
    predicted_groups = (predicted_scores >= 0).astype(int)
    This will output values of 0 and 1 to indicate the groups of samples submitted for prediction. cv object has the attribute groups storing the group names which were assigned in labels input for training. To access the group names after prediction, use
    print([cv.groups[g] for g in predicted_groups])
    Set return_scores=True will return predictive scores for OPLS.
  4. Other methods.
    cv provides a method reset_optimal_num_component to reset the optimal number of components manually, instead of defaultedly at the minimal number of mis-classification.


Nai-ping Dong
Email: [email protected]


This project is licensed under the Apache 2.0 License - see the LICENSE for details.


