skops
skops copied to clipboard
Add Method type in dispatch calls
Fixes https://github.com/skops-dev/skops/issues/184
PR which adds MethodType and related functions to the skops.io dispatch functions.
I can expand on the test if more cases/checks are needed, but for now this shows the functionality works as expected.
Well done figuring out where the error occurs.
Before we continue, I really want to understand what is going on here and if the suggested solution is the best one.
So first, it's true that we don't have any special case for objects whose attributes are methods bound to another object. E.g. this fails for now:
class Foo(BaseEstimator):
def fit(self, X, y=None, **fit_params):
self.bound_method_ = FunctionTransformer().transform
return self
def transform(self, X):
return self.bound_method_(X)
def test_bound_method(tmp_path):
estimator = Foo().fit(None)
f_name = tmp_path / "file.skops"
loaded = save_load_round(estimator, f_name)
gives:
def object_get_instance(state, src):
if state.get("is_json", False):
return json.loads(state["content"])
cls = gettype(state)
> instance = cls.__new__(cls)
E TypeError: method expected 2 arguments, got 0
I think we should add a test along these lines to our test suite.
How does this relate to the original issue? From what I can tell, the problem stems from these lines:
https://github.com/scipy/scipy/blob/1023d9207fdc1430a8ba196f1a1616ac3c264acf/scipy/stats/_distn_infrastructure.py#L3183-L3186
The vecentropy attribute is the one causing the issue. It is a numpy.vectorize of a bound method, but it's not actually bound to another instance, it's bound to itself! Still, it is recognized as a bound method, hence dispatches to the function that @E-Aho implemented.
What is strange to me is that the same issue does not apply to self._cdfvec, even though it's treated the same as vecentropy.
Another thing that surprised me is the following: I removed the whole vecentropy object from the state (I did this by going inside of dict_get_state and if the key is "vecentropy", I just skip it). I would have thought that this cannot work, but it does, the test passes without requiring method_get_state/instance, and the vecentropy attribute exists after loading.
I dug into this and I think it's because _attach_methods is called during __setstate__:
https://github.com/scipy/scipy/blob/1023d9207fdc1430a8ba196f1a1616ac3c264acf/scipy/stats/_distn_infrastructure.py#L636
If that's true, we should be able to drop those methods entirely from the state and let scipy restore them during __setstate__. WDYT?
Regardless of that, as mentioned at the top, we should have a way to save and load methods bound to other instances.
I also wondered whether we can add a special case for the numpy.vectorize class, similar to what we have with partial. Maybe that can solve the initial problem, but it wouldn't solve the general bound method issue.
If that's true, we should be able to drop those methods entirely from the state and let scipy restore them during setstate. WDYT?
I fully agree, I ran into this as well while testing, but with "dummy functions" that returned nothing, a few other tests that checked on the dispatch functions started failing too.
I think in this case, it could definitely be solved by dropping these methods and adding a special condition to check for this during loading.
However, in a general case, I think there still needs to be a solution for bound methods more generally, which I think would be along the lines of this PR. E.g:
class MethodHolder:
def __init__(self, variant):
if variant == "sqrt":
self._method = np.sqrt
elif variant == "log":
self._method = np.log
else:
self._method = np.exp
def func(self, x):
return self._method(x)
a = MethodHolder("log")
b = a.func
I feel like something along these lines isn't unlikely to exist somewhere in SKLearn/SciPy (given we've already found one that happens to get bounded during init), and it would probably be a good idea to provide support for serializing it correctly, which shouldn't be too hard to do.
I think there still needs to be a solution for bound methods more generally, which I think would be along the lines of this PR
:+1: This is exactly what I meant when I wrote
we should have a way to save and load methods bound to other instances
We have to be precise with the wording here, we already persist objects with bound methods without issue, the problem are attributes that are methods bound to another object than the one we're persisting.
Once we have this general issue solved, if it doesn't solve the scipy issue, we can think about a more specific solution there.
We have to be precise with the wording here, we already persist objects with bound methods without issue, the problem are attributes that are methods bound to another object than the one we're persisting.
Ah sorry, you're entirely correct. I neglected it in my comment, but I was meaning to show that if you used that b in a different object or function, then serialized that, that would be where this bug exists (to my understanding).
e.g if we had
a = MethodHolder("log")
b = a.func
c = SomeOtherObject(func=b)
And tried to serialise c, this would happen I believe.
I think one other place this might occur for users is when trying to serialize something that can take a function as an input (e.g DistanceMetrics with a user defined distance).
If that function is a method bound to a different object, it wouldn't work right now AFAIK
And tried to serialise
c, this would happen I believe.
Yes, I think it would fail. Is there a qualitative difference to the other example I posted:
class Foo(BaseEstimator): def fit(self, X, y=None, **fit_params): self.bound_method_ = FunctionTransformer().transform return self def transform(self, X): return self.bound_method_(X)
No, both would error, I was trying to think through a case where you would need to support bound methods that get manipulated during class creation/use and couldn't be static
(e.g depends on the "variant" param in my example)
No, both would error, I was trying to think through a case where you would need to support bound methods that get manipulated during class creation/use and couldn't be static
Ah okay, got it, I thought maybe I was missing something.
CI runner is failing with a 404 error on uploading to CodeCov 🤔
CI runner is failing with a 404 error on uploading to CodeCov thinking
Most of the time, it's just some flaky behavior that can be fixed by rerunning the CI.
Should be ready for review when you have a moment @BenjaminBossan :) let me know if you have any thoughts!
Overall this looks good, thanks a lot for investigating and proposing a solution. I made a few comments, please have a look.
Thanks for the comments! And it was a fun problem to investigate :)
And just to be clear, the initial issue reported in #184 is not solved yet, right? When I test it, I get an error (infinite recursion). Since we changed the skope (haha) of this PR, that's fine, I just want to make sure that this is expected.
😨
I definitely changed my focus from the original issue to a more general case, but I'm surprised it hasn't addressed the original issue, given it was working with just empty function calls for method_get_state and method_get_instance...
I'll look into this and see if there's a way to address the original bug here too. I don't love the idea of merging in a bug fix if it introduces a new bug, especially one like an infinite recursion 😨
So after some digging, I think the problem with the original issue being an infinite loop is as follows:
- We try to persist
zipfas an object - We get the states of the attributes of
zipfwhich includesself.vecentropy, which is justnumpy.vectorize(self._entropy) - We try to persist the bound method,
self._entropy, which depends on the state of the original zipf object, so try to persist statezipf - goto 1)
This results in the loop.
I don't think this is a common pattern, so could add a special case for any rv_discrete objects, but I'm gonna try to see if I can think of a way to work around this for general cases of circular bound methods like this.
Great that you figured out the cause. I think for this PR, it's not necessary to solve the whole circular reference problem, that might be best left for another day or else the PR will become too big.
As for potential solutions, we might be able to use the memoization mechanism that we already added here. There is an example of its usage for numpy arrays (which I now notice could be optimized...).
Sounds good, I can raise an issue once this gets merged in to track the issue that's occuring with zipf right now 👍
Huh, looks like given the changes in #200, neither FunctionType nor MethodType ever get returned by gettype.
Huh, looks like given the changes in #200, neither FunctionType nor MethodType ever get returned by
gettype.
Ah yes, this is not because gettype was changed in any way, this is because we no longer call gettype to determine kind of get_instance function to dispatch to.
@BenjaminBossan if you have a minute could you rerun the CI?
if you have a minute could you rerun the CI?
Sure, but feel free to ignore those "Error: Codecov: Failed to properly upload" messages. For some reason, those errors occur more frequently since the past week or so, but they're unrelated to anything actually being wrong.
Oh for sure, I just wanted to check Codecov was actually happy now, the MethodType return not getting tested was making it upset
I just wanted to check Codecov was actually happy now, the MethodType return not getting tested was making it upset
For sure, but you can take any of the passing CI runs to check that, no need to wait for each one to confirm.
@BenjaminBossan, let me know if you have a moment to check the changes since your last review :)
Ok, I've reverted this back to prior to the LoadState work.
I'll open an issue once this merges in and try to keep working on a LoadState that can load single instances across any object type :)
Awesome, thanks for all the support guys!
I've reworded the xfail reason to be a bit clearer, I'm happy to merge this in if you're happy @BenjaminBossan
@E-Aho Excellent, thanks, really great job.
Will you be working on the deduplication issue?
@E-Aho Excellent, thanks, really great job.
Will you be working on the deduplication issue?
🤗 Thanks for all the support, it really was a huge help.
And yes, I'll keep working on the dupe issue, I'll raise an issue later today once I've got time, and start working on implementing a proper LoadState based on the comments in this PR.