modAL Proof of concept for allowing non-sklearn estimators

Not sure if there is any desire for this feature, but in this PR I have sketched out a way to use virtually any estimator type with the ActiveLearner and BayesianOptimizer classes.

Motivation

Allow us to use other training and inference facilities, such as HuggingFace models that are trained using the Trainer class, use AWS SageMaker Estimators, etc. With this added flexibility, the training and inference does not need to even run on the same hardware as the modAL code. This brings the suite of sampling methods here to many new applications, particularly resource-intensive deep learning models that typically don't fit that great under the sklearn interface.

Implementation

Rather than call the classic sklearn estimator functions such as fit, predict, predict_proba, and score, this PR adds a layer of callables that can be overridden: fit_func, predict_func, predict_proba_func, and score_func.

    def __init__(self,
                 estimator: BaseEstimator,
                 query_strategy: Callable = uncertainty_sampling,
                 X_training: Optional[modALinput] = None,
                 y_training: Optional[modALinput] = None,
                 bootstrap_init: bool = False,
                 on_transformed: bool = False,
                 force_all_finite: bool = True,
                 fit_func: FitFunction = SKLearnFitFunction(),
                 predict_func: PredictFunction = SKLearnPredictFunction(),
                 predict_proba_func: PredictProbaFunction = SKLearnPredictProbaFunction(),
                 score_func: ScoreFunction = SKLearnScoreFunction(),
                 **fit_kwargs
                 ) -> None:

I added SKLearn implementations of each by default (included their corresponding Protocol classes as well). Here's how fit works:

class FitFunction(Protocol):
    def __call__(self, estimator: GenericEstimator, X, y, **kwargs) -> GenericEstimator:
        raise NotImplementedError
# ...
class SKLearnFitFunction(FitFunction):
    def __call__(self, estimator: BaseEstimator, X, y, **kwargs) -> BaseEstimator:
        return estimator.fit(X=X, y=y, **kwargs)

I'll also note that the changes in this PR don't break any of the existing tests.

Usage

When using SageMaker, we might implement fit and predict_proba in this manner:

class CustomEstimator:
    hf_predictor: Union[HuggingFacePredictor, Predictor]
    hf_estimator: HuggingFace

    def __init__(self, hf_predictor: HuggingFacePredictor, hf_estimator: HuggingFace):
        self.hf_predictor = hf_predictor
        self.hf_estimator = hf_estimator

class CustomFitFunction(FitFunction):
    def __call__(self, estimator: CustomEstimator, X, y, **kwargs) -> CustomEstimator:
        # notice we don't use `y` -- the label is baked into the HuggingFace Dataset
        return estimator.hf_estimator.fit(X=X, **kwargs)

class CustomPredictProbaFunction(PredictProbaFunction):
    @staticmethod
    def hf_prediction_to_proba(predictions: Union[List[Dict], object],
                               positive_class_label: str = 'LABEL_1',
                               negative_class_label: str = 'LABEL_0') -> np.array:
        label_key: str = 'label'
        score_key: str = 'score'
        p = []
        for prediction in predictions:
            if positive_class_label == prediction[label_key]:
                score = prediction[score_key]
                p.append([score, 1.0 - score])
            if negative_class_label == prediction[label_key]:
                score = prediction[score_key]
                p.append([1.0 - score, score])
        return np.array(p)

    def __call__(self, estimator: CustomEstimator, X, **kwargs) -> np.array:
        return self.hf_prediction_to_proba(
            predictions=estimator.hf_predictor.predict(dict(inputs=X))
        )

estimator = CustomEstimator(hf_predictor=hf_predictor, hf_estimator=hf_estimator)

learner = ActiveLearner(
    estimator=estimator,
    fit_func=CustomFitFunction(),
    predict_proba_func=CustomPredictProbaFunction(),
    X_training=train_dataset # standard HuggingFace Dataset instead of your typical types for `X` in `sklearn`
)

If you've made it this far, I'd ask that you forgive the clunkiness. This was a rough sketch of an idea I wanted to get written down before I forgot it. Anyways, would love some feedback, and if you think this PR is worth finishing, let me know. I can say for me, this would unlock a lot of really useful applications.

Aug 10 '22 23:08 adelevie

I found this section about using custom estimator in the documentation. Have you tried it?

Aug 12 '22 14:08 mle-els

I had not, so thanks for bringing that to my attention @mle-els. This at least suggests to me there is some desire to use custom estimators without Skorch (though a fine option for many use cases).

As long as your classifier follows the scikit-learn API, you can use it in your modAL workflow. (Really, all it needs is a .fit(X, y) and a .predict(X) method.) For instance, the ensemble model implemented in Committee can be given to an ActiveLearner.

I am not sure how accurate this is. Glancing through the BaseLearner class, I've tracked the following uses of the estimator attribute:

.fit()
.predict()
.predict_proba()
.score()
.estimators_

And in ActiveLearner:

.classes_

On top of this, I am wondering about adding some flexibility for the allowed types for X. Typically in modAL X is Union[np.ndarray, sp.csr_matrix], which works great as long as the estimator is SKLearn conformant. But maybe a generic type could replace it (e.g. allow HuggingFace Dataset instances, among others).

I am proposing adding a bit of standardization around this interface to add reliability to custom estimators. I have tried making my estimator conformant to the SKLearn spec, but 1) it was pretty difficult, and 2) the difficulty was mostly concentrated in implementing attributes and methods that aren't even needed here in modAL.

Aug 12 '22 14:08 adelevie