palladium icon indicating copy to clipboard operation
palladium copied to clipboard

[Feature request] Add a possibility to persist artifacts besides the model itself

Open benjamin-work opened this issue 7 years ago • 6 comments

At the moment, only the model can be persisted and loaded. However, there are scenarios that necessitate saving and loading additional data.

E.g., assume that we have a regression problem. We want to normalize the targets to a certain range during training but when calling the predict service, data should be mapped back to the original range. Touching the targets is not part of an sklearn pipeline, so we may do it during data loading. However, when we start the prediction service, we need to have access to the mapping. Currently, we would have to load the data again to generate the mapping, or try to save the mapping as an attribute of the model.

Ideally, we would be able to just save and load the mapping using palladium tools. The solution should not be too specific to the example above, but be a more general solution to how to persist additional artifacts.

benjamin-work avatar Aug 29 '17 12:08 benjamin-work

Another way to deal with this is to move the normalization into a model wrapper (or "meta-estimator" in scikit-learn). A NormalizeTarget wrapper would normalize on the way in and out. The model is somewhat more self-contained this way, which may be good regardless.

dnouri avatar Aug 29 '17 13:08 dnouri

Yes, for this specific case, that would work. For other cases, that could be an awkward solution. I could imagine that a more general solution would have a "cache" that is just stored together with the model, so that there is no need for handling separate files.

benjamin-work avatar Aug 29 '17 14:08 benjamin-work

There's this utility called palladium.interfaces.annotate which is used by Palladium to store the model version along with the model pickle. It's a glorified way of sticking an attribute onto the object before it's pickled.

To stick something in you would call annotate(model, {'useful': 'stuffs'}), and to get it out again (say in production, after loading): stuffs = annotate(model)['useful'].

dnouri avatar Aug 29 '17 14:08 dnouri

Okay, so you would suggest to use this if extra data needs to be saved?

benjamin-work avatar Aug 29 '17 15:08 benjamin-work

Okay, so you would suggest to use this if extra data needs to be saved?

Hmm, just had another look and it seems that at least palladium.persistence.Database assumes it can call json.dumps on the annotations. (It then stores the annotations in a separate column.) So this won't work for all types of data.

Which leaves us with what you already did I assume, which is sticking attributes on the model object. Not too nice, but probably nicer than having to worry about storing extra data somewhere else and having to support that in all persisters.

If you prefer to use something like annotate, then we could make a trivial change and add other keys, besides __metadata__, to annotate. (__metadata__ is what it's trying to be clever about when persisting.)

dnouri avatar Aug 29 '17 15:08 dnouri

But isn't the model just a blob? Instead of persisting the model, could we not persist something like {'model': model, 'cache': cache}? That way, we don't need to store something extra and worry about keeping model and extra in sync.

benjamin-work avatar Aug 29 '17 15:08 benjamin-work