modAL
modAL copied to clipboard
Proof of concept for allowing non-sklearn estimators
Not sure if there is any desire for this feature, but in this PR I have sketched out a way to use virtually any estimator type with the ActiveLearner
and BayesianOptimizer
classes.
Motivation
Allow us to use other training and inference facilities, such as HuggingFace models that are trained using the Trainer
class, use AWS SageMaker Estimators
, etc. With this added flexibility, the training and inference does not need to even run on the same hardware as the modAL
code. This brings the suite of sampling methods here to many new applications, particularly resource-intensive deep learning models that typically don't fit that great under the sklearn
interface.
Implementation
Rather than call the classic sklearn
estimator functions such as fit
, predict
, predict_proba
, and score
, this PR adds a layer of callables that can be overridden: fit_func
, predict_func
, predict_proba_func
, and score_func
.
def __init__(self,
estimator: BaseEstimator,
query_strategy: Callable = uncertainty_sampling,
X_training: Optional[modALinput] = None,
y_training: Optional[modALinput] = None,
bootstrap_init: bool = False,
on_transformed: bool = False,
force_all_finite: bool = True,
fit_func: FitFunction = SKLearnFitFunction(),
predict_func: PredictFunction = SKLearnPredictFunction(),
predict_proba_func: PredictProbaFunction = SKLearnPredictProbaFunction(),
score_func: ScoreFunction = SKLearnScoreFunction(),
**fit_kwargs
) -> None:
I added SKLearn implementations of each by default (included their corresponding Protocol
classes as well). Here's how fit
works:
class FitFunction(Protocol):
def __call__(self, estimator: GenericEstimator, X, y, **kwargs) -> GenericEstimator:
raise NotImplementedError
# ...
class SKLearnFitFunction(FitFunction):
def __call__(self, estimator: BaseEstimator, X, y, **kwargs) -> BaseEstimator:
return estimator.fit(X=X, y=y, **kwargs)
I'll also note that the changes in this PR don't break any of the existing tests.
Usage
When using SageMaker, we might implement fit
and predict_proba
in this manner:
class CustomEstimator:
hf_predictor: Union[HuggingFacePredictor, Predictor]
hf_estimator: HuggingFace
def __init__(self, hf_predictor: HuggingFacePredictor, hf_estimator: HuggingFace):
self.hf_predictor = hf_predictor
self.hf_estimator = hf_estimator
class CustomFitFunction(FitFunction):
def __call__(self, estimator: CustomEstimator, X, y, **kwargs) -> CustomEstimator:
# notice we don't use `y` -- the label is baked into the HuggingFace Dataset
return estimator.hf_estimator.fit(X=X, **kwargs)
class CustomPredictProbaFunction(PredictProbaFunction):
@staticmethod
def hf_prediction_to_proba(predictions: Union[List[Dict], object],
positive_class_label: str = 'LABEL_1',
negative_class_label: str = 'LABEL_0') -> np.array:
label_key: str = 'label'
score_key: str = 'score'
p = []
for prediction in predictions:
if positive_class_label == prediction[label_key]:
score = prediction[score_key]
p.append([score, 1.0 - score])
if negative_class_label == prediction[label_key]:
score = prediction[score_key]
p.append([1.0 - score, score])
return np.array(p)
def __call__(self, estimator: CustomEstimator, X, **kwargs) -> np.array:
return self.hf_prediction_to_proba(
predictions=estimator.hf_predictor.predict(dict(inputs=X))
)
estimator = CustomEstimator(hf_predictor=hf_predictor, hf_estimator=hf_estimator)
learner = ActiveLearner(
estimator=estimator,
fit_func=CustomFitFunction(),
predict_proba_func=CustomPredictProbaFunction(),
X_training=train_dataset # standard HuggingFace Dataset instead of your typical types for `X` in `sklearn`
)
If you've made it this far, I'd ask that you forgive the clunkiness. This was a rough sketch of an idea I wanted to get written down before I forgot it. Anyways, would love some feedback, and if you think this PR is worth finishing, let me know. I can say for me, this would unlock a lot of really useful applications.
I found this section about using custom estimator in the documentation. Have you tried it?
I had not, so thanks for bringing that to my attention @mle-els. This at least suggests to me there is some desire to use custom estimators without Skorch (though a fine option for many use cases).
As long as your classifier follows the scikit-learn API, you can use it in your modAL workflow. (Really, all it needs is a .fit(X, y) and a .predict(X) method.) For instance, the ensemble model implemented in Committee can be given to an ActiveLearner.
I am not sure how accurate this is. Glancing through the BaseLearner
class, I've tracked the following uses of the estimator
attribute:
-
.fit()
-
.predict()
-
.predict_proba()
-
.score()
-
.estimators_
And in ActiveLearner
:
-
.classes_
On top of this, I am wondering about adding some flexibility for the allowed types for X
. Typically in modAL
X
is Union[np.ndarray, sp.csr_matrix]
, which works great as long as the estimator is SKLearn conformant. But maybe a generic type could replace it (e.g. allow HuggingFace Dataset
instances, among others).
I am proposing adding a bit of standardization around this interface to add reliability to custom estimators. I have tried making my estimator conformant to the SKLearn spec, but 1) it was pretty difficult, and 2) the difficulty was mostly concentrated in implementing attributes and methods that aren't even needed here in modAL
.