skops Safe persistence solutions

Let's discuss the different options and their pros and cons.

Problem

sklearn models are typically persisted as pickle files. This can be done directly through pickle.dump or indirectly through joblib.dump (though there seem to be few advantages of using joblib over pure pickle these days). No other formats are officially supported by sklearn. Since loading pickle files can result in arbitrary Python code being executed, it is inherently dangerous to load pickle files from untrusted sources. This in turn can prevent adoption of sklearn models on sharing platforms such as HF hub.

(All of this is part of a bigger problem of safe software "supply chains", e.g. when malicious packages are uploaded to pypi.)

Below I listed a few possible solutions. I didn't include any deeper discussion of those on purpose, since there has probably been a lot of discussion in the past which you can hopefully share.

Possible solutions

Vetting

Still use pickle but introduce some kind of process that ensures the safety of the file. E.g. tying the file to specific code which has been approved by some authority.

ONNX

Use ONNX (see here for sklearn and here for more general ML frameworks) to persist and serve sklearn models.

Custom sklearn storage format

Sklearn could theoretically support a custom storage format, like, say, keras.

Jul 14 '22 10:07 BenjaminBossan

cc @McPatate since he has been working on scanning our existing pickle files on the hub for us to better understand what people are putting in those files. But it's a far from being a trivial task.

This also has implications for our backend and these internal issues touch on that: https://github.com/huggingface/api-inference/issues/689 , https://github.com/huggingface/api-inference/issues/453

There are issues with relying on ONNX at this stage:

One cannot convert an ONNX model back to a sklearn model.
ONNX supports only a subset of sklearn models and it's not clear how one can add a custom estimator to the supported estimators, and therefore this would mean we can't support third party estimators

Our conclusion has been that we might need a custom persistence model which wouldn't necessarily support everything, but it would be an easy one to start with and easily extensible.

There's also an issue on sklearn: https://github.com/scikit-learn/scikit-learn/issues/22759 which touches on the same issues.

From my conversations with people, it seems the way forward for now is for us to experiment with a format in this library, and if it turns out to be reliable, we can then talk about moving it to upstream sklearn maybe.

@Narsil has also been interested in this topic. @koaning has also done some work in this space: https://github.com/koaning/icepickle

Jul 18 '22 11:07 adrinjalali

For enabling more safety on pickle:

It's not possible to guarantee safety on pickle (it's not the module's intention to be safe, so we cannot expect anything from Python on that front).
We can make it much safer, by allowing only specific ops (read python functions to be called when loading the file). By doing a black list (typically of import and eval) we can make the unpickling safer but any package install that re-exposes the same functionnality are now vectors of attack. By doing a whitelist we can make sure that the whitelisted functions don't re-expose the unsafeness of eval and import. But whitelist will not be future proof, and it might become harder to save/export custom made models with custom code. None of these are inherently safe (as we might have gotten something wrong, and Python C code might still be buggy). Still a few orders of magnitude safer than unchecked pickling.

I don't know enough about sklearn to discuss other formats in depth.

Jul 18 '22 12:07 Narsil

@Narsil has "PoC-ed" (if I may) a custom format of his own : https://github.com/Narsil/safetensors/

On the topic of vetting pickles, I feel that what I found could guarantee a safe unpickling. That being said, it's in theory and I know that in practice hackers are very creative and there's probably something that I'm missing (cf Narsil's answer !). At the very least, the features we propose with our file scanner on the hub make it harder for people to exploit unpickling.

Context

I have managed to implement a piece of code that reads pickle opcodes and extracts imports. The opcodes are read sequentially by the unpickler and based on the opcodes a given action is executed. Functions and imports are what is dangerous while unpickling.

To give an example, I'll write some python code mimicking what happens when unpickling :

opcodes_stack = [myfunc, "malicious argument", "REDUCE"]
opcode = stack.pop()
if opcode == "REDUCE":
    arg = opcodes_stack.pop()
    callable = opcodes_stack.pop()
    opcodes_stack.append(callable(arg))
# ...

Now the function cannot be defined in the pickled file, as only references to functions can be serialised (meaning they have to be in scope during the unpickling) ; also, lambdas can not be pickled. The reason why imports should be monitored are the following :

when importing a python module, arbitrary code can be executed
you can import builtin functions like eval or exec, which can be used to execute arbitrary code
when instanciating an object, the constructor is called

Jul 18 '22 12:07 McPatate

I looked at icepickle and from what I see it saves coefficients, and every time you want to load the model it initializes the same type of model with the loaded coefficients. It's really neat, and it's something we can use Hub on, e.g. store the type of model, store coefficients in separate file and not actually have model itself on Hub. When someone tries to call the repo, we can initialize the model with weights under the hood by reading the type specs from model card (maybe we can store it in metadata for convenience?) and finding coefficients file instead of letting user deal with it, do the same with saving. (kudos @koaning)

It only works for linear models though:

from sklearn.linear_model import (
    SGDClassifier,
    SGDRegressor,
    LinearRegression,
    LogisticRegression,
    PassiveAggressiveClassifier,
    PassiveAggressiveRegressor,
)

Jul 18 '22 12:07 merveenoyan

I believe we should also consider what we want to achieve in the end.

We can make it much safer

This would probably help when it comes to "accidental" insecurity, e.g. when someone puts an eval into their code without regard to security concerns. But attackers will exploit any backdoor we overlook, so "safer" would not be good enough there, the persistence format would have to be something that is safe by design. For users working in orgas or companies with strict security requirements (say, healthcare, banking, etc.), safer would probably not suffice.

There are issues with relying on ONNX at this stage:

One cannot convert an ONNX model back to a sklearn model.

ONNX supports only a subset of sklearn models and it's not clear how one can add a custom estimator to the supported estimators, and therefore this would mean we can't support third party estimators

These are some valid concerns. However, I can imagine that whatever road we take, we will start out with only supporting a subset of sklearn. Regarding custom estimators, I believe the situation will improve in the future. Maybe there will even be a possibility to leverage PyTorch's scripting/tracing or something along those lines? So despite the shortcomings of ONNX (especially not being able to convert back to sklearn), it could still be a partial solution (but never the only one).

Mid to long term, my preference would be something supported on the sklearn side, but of course that's a huge undertaking. Going the icepickle road could be a transition. I wonder if it would be possible for sklearn to define an API/protocol for this persistence solution, e.g. as mentioned by Adrin in https://github.com/scikit-learn/scikit-learn/issues/22759, even without providing any implementation. That way, external libraries could start implementing against that API.

As that issue also mentions, one difficulty would be custom code, e.g. in FunctionTransformer. However, it shouldn't be too hard to have a safe FunctionTransformer if the function is already defined, e.g., FunctionTransformer(func=np.sqrt), right? Of course that's only a subset, but perhaps not such a small subset.

Jul 18 '22 14:07 BenjaminBossan

@merveenoyan note that these linear models work very well with stateless featurizers.

One thing that icepickle won't solve though: stateful featurizers. So very fancy pipelines won't be saved with icepickle.

Jul 18 '22 14:07 koaning

That said, sklearn-onnx is honestly quite awesome, but it doesn't support every model out there. In particular, it doens't allow for the partial-fit trick but it also misses on some featurizers.

Jul 18 '22 14:07 koaning

What about providing a checksum on a pickle file? Still won't solve many of the issues. But you could at least implement a loader that forces the user to provide a checksum.

Jul 18 '22 14:07 koaning

What about providing a checksum on a pickle file? Still won't solve many of the issues. But you could at least implement a loader that forces the user to provide a checksum.

I like that idea, similar to what some packaging solutions provide. Maybe it could be enforced via a (potential) protocol?

Jul 18 '22 14:07 BenjaminBossan

I don't know about protocols, but something like this should suffice:

from skops.io import load_pickle_safely 

load_pickle_safely("just/a/filepath.pickle")
# ValueError("Please add a md5 hash that you varified upfront.")
load_pickle_safely("just/a/filepath.pickle", md5="a")
# ValueError("Alarm! The md5 hash does not match!")

Jul 18 '22 14:07 koaning

I guess you could make it even fancier by having a (Huggingface) backend also provide an endpoint that can turn a varify the md5 hash for the picklefile from a specific org/modelname/version number. But the base behavior can just be a simple md5 check.

Jul 18 '22 14:07 koaning

If the attack vector implies that the attacker has access to both the file and the checksum, then checksumming is useless.

checksumming only prevents man in the middle attacks where the checksum itself cannot be man in the middled.

Not saying it has no merit, but it's definitely not the attack vector I have in mind when talking about pickle attacks.

Jul 18 '22 17:07 Narsil

Now that we have our persistence solution, and that we'll be working on ONNX, I think this issue can be closed and we can track progress on more detailed issues. (Feel free to re-open if I'm missing something.)

Nov 29 '22 15:11 adrinjalali

skops skops copied to clipboard

Safe persistence solutions

Problem

Possible solutions

Vetting

ONNX

Custom sklearn storage format

Context

skops
skops copied to clipboard