cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[DO NOT MERGE] Make isinstance check pass for proxy ndarrays

Open Matt711 opened this issue 1 year ago • 6 comments

Description

Closes #14537. Do not merge yet, I need to run these changes on the cudf.pandas integration test suite.

Checklist

  • [ ] I am familiar with the Contributing Guidelines.
  • [ ] New or existing tests cover these changes.
  • [ ] The documentation is up to date with these changes.

Matt711 avatar Aug 19 '24 18:08 Matt711

/ok to test

Matt711 avatar Aug 26 '24 22:08 Matt711

This PR fixes the two issues with #16286 (ie. the previous version of this PR that we reverted). The two issues were:

  1. In the __array_finalize__(self, obj), the _fsproxy_wrapped attribute is set equal to obj, which could be None. In this PR, there's a check if (obj is None): return to prevent that. I ran the cudf-pandas-integration tests, and it seems to fix the tests that were failing from the previous version of this PR. Once #16645 is merged , we'll be able to see the third-party integration tests passing in cudf.

  2. In __new__, there was an initial DtoH transfer when a cp.ndarray was passed to the proxy array class. This PR doesn't include the DtoH transfer. Instead, in __new__, we're creating a "garbage" np.ndarray with the correct shape and dtype, but with incorrect/garbage data.

The fix for 2. leads to a problem which the only way I can think of solving is via monkeypatching. The problem is that now that our proxy array passes the check isinstance(proxy, np.ndarray), there are many numpy functions (which will call NumPy C functions) that will assume that they can use the array's buffer instead instead of going through a double underscore method like __array__. This is a problem because our proxy array's buffer has garbage data, so the function will produce garbage data.

Take np.asarray for an example. It will eventually lead to some like code like this. Our proxy array passes the PyArray_Check but gives incorrect data in the PyArray_DATA call.

if (PyArray_Check(obj)) {
    void* buffer = PyArray_DATA(obj) // access the buffer directly
}

I think the only way to avoid this problem from Python is to monkeypatch every function like np.asarray. So far I've found several functions we'll need to patch: np.asarray, np.where, np.outer, np.inner, and several other functions in the np.linalg module.

There is at least one other function in a third-party library which we'll need to patch (torch.from_numpy). This function function could be patched in the same way the numpy functions are.

Numba CPU dispatched functions also access the proxy arrays buffer. The good news is that eventually numba will support compiling objects which implement __array__ (see numba/9584). This bad news is that this probably won't be ready before 24.10. So in the meantime, we would have to let the user know that using proxy arrays directly in CPU dispatched functions will return incorrect results. If there's not a way to fail loudly in this case, then it should at least be documented. FYI, we currently fail when using proxy arrays in CPU dispatched functions because we fail isinstance check. So if its possible to continue failing, that's not the worst thing to until the next Numba release.

What do you all think of the monkey patching approach?

Another idea would be pay the cost of the DtoH transfer upfront (like in the previous PR). The DtoH is only done on instance creation. And then after that, the fast-slow proxy mechanism is used. The pro here is that the buffer will be set correctly, and thus no monkeypatching. The con is obviously the DtoH transfer.

Matt711 avatar Aug 27 '24 00:08 Matt711

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Aug 27 '24 14:08 copy-pr-bot[bot]

/ok to test

Matt711 avatar Aug 27 '24 14:08 Matt711

Summary of offline discussion: monkey-patching numpy to make this work is a bridge too far without much stronger motivations, and we will probably just go with the eager D2H copy instead.

vyasr avatar Aug 27 '24 18:08 vyasr

/ok to test

Matt711 avatar Aug 28 '24 00:08 Matt711

/ok to test

Matt711 avatar Aug 29 '24 13:08 Matt711

/ok to test

Matt711 avatar Aug 29 '24 21:08 Matt711

Ping @vyasr for a review next week

Matt711 avatar Aug 31 '24 03:08 Matt711

@Matt711 The PR description says "do not merge" but there is no "DO NOT MERGE" label. Can you make this consistent?

Also for team knowledge, the "Description" section of the PR body is used in the final commit message when the PR is merged. Temporary information like the PR state or benchmarks are better to put in comments rather than the "Description" section.

bdice avatar Sep 03 '24 15:09 bdice

/ok to test

Matt711 avatar Sep 04 '24 22:09 Matt711

/okay to test

galipremsagar avatar Sep 04 '24 22:09 galipremsagar

/okay to test

galipremsagar avatar Sep 05 '24 13:09 galipremsagar

/merge

Matt711 avatar Sep 05 '24 13:09 Matt711