huggingface_hub [Feature request] Scikit learn integration

As an initial step, a simple integration in the Inference API would be similar to what is done in this repo.

model = joblib.load(cached_download(
    hf_hub_url(REPO_ID, "sklearn_model.joblib")
))

We could do something similar to the Table QA widget and add a structured-data-classification task.

Unless there's a bigger plan at the moment, this could be a simple enough thing to add from our side to showcase simple classification/regression use cases. We could upload some of the example models from the documentation to a scikit-learn-examples org and let users test them directly in the browser.

WDYT @julien-c?

Jun 09 '21 12:06 osanseviero

sounds good, but let's also think of strategies targeted at distribution/usage growth for this in parallel

on the technical side I was wondering last week if the types (and number) of inputs are encoded in the model, which maybe we could use to populate some metadata (here, input column names for instance)

Maybe a huggingface_hub.ScikitHubMixin or similar would handle that?

Jun 09 '21 15:06 julien-c

re: Inference API, I think we might be facing a security issue, similarly to what happened with spaCy.

From https://joblib.readthedocs.io/en/latest/persistence.html:

joblib.dump() and joblib.load() are based on the Python pickle serialization model, which means that arbitrary Python code can be executed when loading a serialized object with joblib.load().

joblib.load() should therefore never be used to load objects from an untrusted source or otherwise you will introduce a security vulnerability in your program.

Jun 09 '21 19:06 osanseviero

security-wise I think we'll be able to find ways to make it work!

Jun 10 '21 07:06 julien-c

@osanseviero seen work done in https://github.com/huggingface/huggingface_hub/pull/98 and https://github.com/huggingface/huggingface_hub/pull/170, can I merge ?

And if not, is there anything to do ?

Aug 17 '22 13:08 Wauplin

cc @adrinjalali

Aug 18 '22 07:08 LysandreJik

With the work being done in https://github.com/skops-dev/skops/ and the work being done on the api-inference-community side (e.g.: https://github.com/huggingface/api-inference-community/pull/67, https://github.com/huggingface/api-inference-community/pull/79, https://github.com/huggingface/api-inference-community/pull/83), support for scikit-learn is improving. It's an ongoing work, but a basic support is now there and this issue can be closed since the work is done on other repos.

Aug 18 '22 07:08 adrinjalali