skops
skops copied to clipboard
[WIP] Persist: Add auditing feature
Addresses #138
Description
The goal of this PR is to add an auditing feature which is used while an object is being loaded. Before the actual object is loaded, the state is inspected and some audits are performed. If at least one audit fails, by default, the loading process is interrupted and the user is informed about what went wrong.
Functionally, this does more or less the same as Adrin's work on https://github.com/adrinjalali/skops/tree/audit only with much more code ;)
Implementation
The reason why much more code is used here is that I aimed at making the auditing extensible. This way, it would be possible in the future for each user to add their own auditing functions on top of the ones used in skops if they require additional security.
Also, in contrast to Adrin's branch, this implementation does not traverse the nodes of the state recursively. Instead, each get_instance
function is decorated by an auditing object which is responsible for calling each audit function before the respective object (or sub-attribute) is being loaded.
This is better than traversing the nodes because here, we can re-use the node traversing that's already implemented, making absolutely sure that no node is being missed.
At the moment, the actual auditing functions are very rough and do not actually help much with security. This is okay, as this PR should serve to discuss the general implementation. Implementing the actual auditing functions can be done later.
Furthermore, we have "false positives", e.g. certain sklearn functions which are marked as insecure. Therefore, our current test suite would fail with the given auditing functions. However, I configured the tests so as to not raise an error when encountering an insecure node, instead just logging the results.
I will also add more unit tests (there are currently only 2 very rudimentary ones) if and when we decide to go ahead with this implementation.
RFC @skops-dev/maintainers
It seems to me this is calling the same audit method for all state types.
What I was thinking, was to pass a state to load functions and let them to the audit, since the audit is specific to those methods, not something general. This would also significantly reduce the diff. It can be extended by users and libs through what we provide they're able to pass/modify the "load_state" object we'd be passing the load functions.
However, the issue with my approach, or yours, is that we'd fail at the first instance of finding something insecure.
From the users' perspective, it'd be much nicer for us to scan somehow, and tell them what's unsafe, and ask if they're okay with it, instead of failing one by one, which can be frustrating for them.
It seems to me this is calling the same audit method for all state types.
Yes, but of course it's easy to implement an audit function that checks the type and only audits if it is responsible for that type. I see that there is some use in having it inside the get_instance
methods for locality reasons, but that also means that those methods could, over time, grow quite big because they do multiple things. In fact, it would already be possible to do checks inside get_instance
, right? Do they need any information on top of the state
for that?
Edit: Maybe we could keep the proposed setup but make it easy to register audit functions for specific types? Something like audit_chain.register(my_audit_func, module="numpy", cls="ndarray")
. End Edit
Regardless of that, I believe that even if we had this, we should also have some general checks which we don't want each get_instance
method to repeat.
they're able to pass/modify the "load_state" object we'd be passing the load functions.
I wonder if that could reduce security?
However, the issue with my approach, or yours, is that we'd fail at the first instance of finding something insecure.
I don't think that's the case, unless I misunderstand. Take the test that I added:
def test_insecure_function(self, audit_logs, tmp_path):
estimator = FunctionTransformer(dummy)
save_load_round(estimator, tmp_path / "file.skops")
expected = (
"The following 2 security violations have been found: Loading function"
" 'dummy' of module 'test_persist' is considered insecure. Untrusted"
" module 'test_persist' found"
)
assert len(audit_logs) == 1
assert audit_logs[0] == expected
If we change this line:
- estimator = FunctionTransformer(dummy)
+ estimator = FunctionTransformer(dummy, inverse_func=dummy)
we get:
- assert len(audit_logs) == 1
+ assert len(audit_logs) == 2
Is that what you mean?
The audit message is the same in both cases. Maybe we can improve it by indicating what attribute exactly causes the issue (func
and inverse_func
in that case).
I think this approach is over-complicating what we need to do. It might be very extensible, but it reminds me of the comic where the requirement is a bike and engineers deliver a truck 😁
on the auditing part, the issue is in this design we're actually loading objects and creating instances no matter what the result of the audit is. We probably shouldn't be doing that. Let me think more about it.
I think this approach is over-complicating what we need to do. It might be very extensible, but it reminds me of the comic where the requirement is a bike and engineers deliver a truck
I'm a fan of YAGNI but also of "don't dig yourself into a whole you can't get out of". So if I can anticipate a feature requirement, I'll try to make sure it can be accommodated with the existing design. That said, I'm very open to different suggestions, that's why I only did a draft.
on the auditing part, the issue is in this design we're actually loading objects and creating instances no matter what the result of the audit is. We probably shouldn't be doing that. Let me think more about it.
I guess it's half true, we audit each node and load it if auditing is successful. We do not load any node that fails the audit. However, we do not check each node first and only then start loading then, so it could happen that we load 99% of an object and fail on the last 1%.
Probably, if we don't want to do two passes through the tree, this cannot be avoided. If we decide to do two passes, we have to be extra careful that the traversing logic is 100% identical.
Closing in favor of #204