skops
skops copied to clipboard
Secure persistence: add extensibility
A mechanism should be implemented that allows library authors to add secure persistence to their library, and for users to opt those libraries in.
Thinking more about this, what do you think of supporting only __getstate__
/__setstate__
, and allowing certain types as we do now, like lists, dicts, function, etc.
This would mean we don't need to do much on our side, and they only need to implement a sensible __getstate__
/__setstate__
pair.
If we manage to move all sklearn models to the same path, then we can completely remove __reduce__
.
That sounds good, we can start with __getstate__
& __setstate__
and hopefully that's good enough. If it isn't, we can later think about providing more. If we can eventually stop supporting __reduce__
, it should really lower our burden as well.
Still, the user should have to opt in explicitly. Do you already have an API in mind?
I'm not sure what you mean by user in this context. I'm thinking the save
would automatically work if the object of third party libraries provide these methods, and during load, we'd fail if we don't know the library, and user can say they trust the source kinda thing.
I'm still not sure if we should load simple objects if the user has the corresponding library installed or not, if they have it installed, they trust it?
and during load, we'd fail if we don't know the library, and user can say they trust the source kinda thing.
That's what I meant. How would the API for this look like?
I'm still not sure if we should load simple objects if the user has the corresponding library installed or not, if they have it installed, they trust it?
Previously, I would have said that if you installed a malicious library, you're probably already compromised, so there is nothing to be gained from not trusting it at this point. But IIUC, Python is moving in a direction of not allowing arbitrary code to be run during installation (avoiding setup.py
), so just installing a malicious package wouldn't mean you're compromised. But for the time being, there is of course still a lot of code that relies on setup.py
that needs to be supported and I'm not sure if that's planned to be deprecated. (This is all from memory, maybe I get some details wrong.)
Packages can run arbitrary code through .pth
files. We don't need setup.py
to run arbitrary code :D
discussions about removing .pth
files: https://bugs.python.org/issue33944
rejected PEP about an alternative mechanism: https://peps.python.org/pep-0648/
While saving models, we save everything, it's only while loading that security issues come in play. Therefore I think simply relying on __getstate__
/__setstate__
is enough, and there doesn't need to be an extra user facing API for it.
Packages can run arbitrary code through
.pth
files. We don't needsetup.py
to run arbitrary code :D
Oh man, it never stops, does it? Okay, so we can basically assume that if a malicious package is installed, the user is already compromised.
While saving models, we save everything, it's only while loading that security issues come in play. Therefore I think simply relying on
__getstate__
/__setstate__
is enough, and there doesn't need to be an extra user facing API for it.
Okay, let's do it this way and hopefully it'll be sufficient.