skops Have attributes of training dataset in the repository

The widget is cool and everything but it's hard to see all the unique values of categorical variables, which variables are categorical or the range for continuous columns. Couple of solutions:

Have attributes in config or README file
Have these in a separate file. Ping @skops-dev/maintainers

Jan 16 '23 17:01 merveenoyan

I agree it would be useful to have this information.

Some questions I would have:

How would this information be collected? I don't think it's feasible to automatically derive it from the training data. Even if it's a pandas df, there is still room for ambiguity. Therefore, it sounds like the user would have to indicate the information.
What are all the different types that can exist? Categorical, ordinal, cardinal. How about time (at what resolution)? Text? Images? I don't think there is an agreed upon standard for all feature types.
Is there a standard of how to represent these types? It would be good if we didn't have to invent something new.

Of course, we don't have to have everything right from the start, but we should have an idea of what this addition would entail. And to me, it looks like it's far from trivial.

Jan 17 '23 10:01 BenjaminBossan

I think it'd make sense to have this in the README as a part of the model card, we can have some method to generate as much info as we can from a given input dataframe for example.

Jan 19 '23 16:01 adrinjalali

I think the reason why Merve wanted to have them in the config.json or a separate file is that this information could be used to improve the UI on Hub. E.g. in the inference widget, if we know the distinct values of a categorical features, the widget could allow to choose the value from a list. If this information is added to the README, it would make it more difficult to extract the information.

Jan 20 '23 10:01 BenjaminBossan

I see, for that I'm happy for that to be in a data-info.yml/json kinda file. We probably don't want to make the config file too large I guess?

Jan 20 '23 13:01 adrinjalali

@adrinjalali I agree.

Jan 20 '23 16:01 merveenoyan

@merveenoyan I'm happy to take this if it still needs to be done!

Sep 03 '23 01:09 lazarust

@BenjaminBossan I'm happy to take this one but had a few thoughts/questions:

When should the file be generated?
Is there a list of data types that we want to support initially? You mentioned a couple above and I agree it would be pretty hard to have all of them since there isn't an agreed-upon standard.

Sep 08 '23 00:09 lazarust

Thanks for taking an interest in the issue. I think there is no definite answer to your question. The initial motivation is to know in advance what options exist for categorical data to improve the widget, but I think Adrin made a good point about file size, which can easily get large if we just record all distinct values, so some kind of compromise would need to be found.

Also, for this feature to make sense, we would need to do work on the widget side as well, for which there is currently no capacity AFAIK, so I would rather not work on this feature right now.

Sep 08 '23 09:09 BenjaminBossan

@BenjaminBossan Sounds good! Is there another issue I could help out with?

Sep 08 '23 14:09 lazarust

If this is something you're willing to jump into, I think we have some room to improve the skops.io persistence format. For instance, support for me external libraries could be added, like scikeras (#388) or skorch :)

Sep 08 '23 14:09 BenjaminBossan

skops skops copied to clipboard

Have attributes of training dataset in the repository

skops
skops copied to clipboard