skops icon indicating copy to clipboard operation
skops copied to clipboard

A dtype converter transformer

Open adrinjalali opened this issue 2 years ago • 5 comments

When uploading models to the hub and using widgets or the inference API, the data passed to the model does not have the same dtype as what was passed to the model during training. Some issues can be fixed by some work on the https://github.com/huggingface/api-inference-community side, but we can also have an sklearn transformer which stores the dtypes and format of the data during fit, and convert the input to that format during transform. Then users can add that at the beginning of their pipeline to fix issues related to dtype mismatch during predict.

adrinjalali avatar Jul 18 '22 10:07 adrinjalali

From a user perspective, I would probably not want to change my model pipeline to accommodate for potential dtype issues when the model is hosted on the hub. Ideally, I would like to keep my model pipeline as is, but also it's not super trivial to insert new steps into an existing pipeline.

But I agree that it would be nice to provide some convenience for this. IIUC, to use inference, the user already writes a custom handler. Maybe we can provide a function that basically generates a code snippet for the handler? Just OTOH:

def convert_dtypes(X, dtypes):
    ...
    return X_conv

def suggest_dtypes(X):
    dtypes = ...
    return f"Add this snippet to your handler: inputs = skops.utils.convert_dtypes(inputs, {dtypes}"

Maybe there are other possibilities as well, e.g. a wrapper, just throwing this out here.

BenjaminBossan avatar Jul 18 '22 12:07 BenjaminBossan

I'd say the overhead of adding this, and making the backend understand it, is higher than asking the user to add a step at the beginning of their pipeline.

Of course we could also support this, but to me it seems a lot harder to write and much more code on the user side than a single liner adding the transformer.

But I agree it'd be nice to also have a way not to change the pipeline and give users' estimators what they need.

adrinjalali avatar Jul 18 '22 12:07 adrinjalali

Okay, I think we can agree on having both options (which should share most of the important code anyway). Just another reason for not modifying the pipeline: Say I create the model for other purposes as well, e.g. to serve it using other means, then I don't want to have that extra step in the pipeline. Thus I would have to create two model artifacts that are 99% identical.

BenjaminBossan avatar Jul 18 '22 14:07 BenjaminBossan

@adrinjalali Is this issue still something that needs to be done? I'm just asking since all the comments are from over a year ago. If so, just so I'm clear the direction we want to move in is to have a sklearn transformer that automatically makes the dtypes the same as they were in training, right?

lazarust avatar Aug 26 '23 00:08 lazarust

@lazarust we haven't had much request for this feature since. So I'd say for now we can let it be and only get back to it when we see more request for it.

adrinjalali avatar Aug 29 '23 13:08 adrinjalali