cudf
cudf copied to clipboard
[REVIEW] Refactor the `Buffer` class
Description
This PR introduces factory functions to create Buffer instances, which makes it possible to change the returned buffer type based on a configuration option in a follow-up PR.
Beside simplifying the code base a bit, this is motivated by the spilling work in https://github.com/rapidsai/cudf/pull/10746. We would like to introduce a new spillable Buffer class that requires minimal changes to the existing code and is only used when enabled explicitly. This way, we can introduce spilling in cuDF as an experimental feature with minimal risk to the existing code.
@shwina and I discussed the possibility to let Buffer.__new__ return different class type instances instead of using factory functions but we concluded that having Buffer() return anything other than an instance of Buffer is simply too surprising :)
Notice, this is breaking because it removes unused methods such as Buffer.copy() and Buffer.nbytes.
~~However, we still support creating a buffer directly by calling Buffer(obj). AFAIK, this is the only way Buffer is created outside of cuDF, which a github search seems to confirm.~~
This PR doesn't change the signature of Buffer.__init__() anymore.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
Codecov Report
:exclamation: No coverage uploaded for pull request base (
branch-22.10@e1a4e03). Click here to learn what that means. The diff coverage isn/a.
:exclamation: Current head ff3c948 differs from pull request most recent head 19114a8. Consider uploading reports for the commit 19114a8 to get more accurate results
@@ Coverage Diff @@
## branch-22.10 #11447 +/- ##
===============================================
Coverage ? 86.48%
===============================================
Files ? 145
Lines ? 22844
Branches ? 0
===============================================
Hits ? 19756
Misses ? 3088
Partials ? 0
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
I generally like the changes in this PR. Overall I might prefer to make more of the factories classmethods so that 1) any access to internals of the Buffer class is restricted to methods, not free functions, and 2) subclasses can inherit and modify the factories as needed. Also relevant in principle (although we are dealing with Python and not C++ here) is item 23 from Scott Meyer's Effective C++: prefer free functions rather than member functions because free functions provide better encapsulation (much more directly relevant in C++ where attributes can actually be made private). As a general guideline I would use classmethods for any factory that uses internal methods or attributes (as_buffer using Buffer.__new__ is a good example). I would use free functions if all they're doing is preprocessing arguments and then forwarding to a constructor, unless we anticipate needing to customize the behavior for subclasses.
buffer_from_pointer currently looks like a good example of the latter, but I think that's misleading. It seems like the sole purpose of buffer_from_pointer is to provide an API-level statement that it Never copies any data and ptr must represents device memory, right? However, that's not really true right now since it just forwards arguments to the constructor, so the type hints are really the only thing indicating that a user shouldn't pass an arbitrary object here. It might make sense to implement this as a classmethod that directly sets the members (essentially just the pointer branch of __init__) so that there's no ambiguity where the function works for arbitrary objects in practice despite being documented otherwise.
I started writing this comment then had to step out. Looks like @wence- had similar comments!
Overall, I agree with both of you, my concern with classmethods arise when we want to switch to other Buffer types seamlessly.
For example, in #10746 we want as_bufer() to return a SpilleableBuffer when the spilling option is enabled globally (e.g. by setting CUDF_SPILLING=True).
Would it be OK if calling Buffer.as_buffer(obj) returns a SpilleableBuffer instance?
Another design issue I like to discuss is how we support Buffer sub-classes like SpilleableBuffer that will be implemented in Cython since Cython classes cannot inherent from Python classes.
I see two approaches:
- Implement
Bufferin Cython - Rename the
Bufferclass to something likeNonSpilleableBufferand then define atyping.ProtocolnamedBufferthatNonSpilleableBufferandSpilleableBuffer(and others) implements.
Initially, I thought that using a typing.Protocol would be a great solution but it has the side effect that calling Buffer() becomes illegal. A fair mount of downstream projects calls Buffer(obj) to convert obj to a buffer thus we would force them to use as_buffer() instead.
So maybe implementing Buffer in Cython is the best solution? Or is there a third option I haven't considered?
Overall, I agree with both of you, my concern with
classmethodsarise when we want to switch to other Buffer types seamlessly. For example, in #10746 we wantas_bufer()to return aSpilleableBufferwhen the spilling option is enabled globally (e.g. by settingCUDF_SPILLING=True). Would it be OK if callingBuffer.as_buffer(obj)returns aSpilleableBufferinstance?
I think this needs some thought. Do you envisage that SpillableBuffer vs Buffer will be a global configuration switch, or do you think that you'll want buffer creation to make spillable objects on a per-buffer basis?
Initially, I thought that using a
typing.Protocolwould be a great solution but it has the side effect that callingBuffer()becomes illegal. A fair mount of downstream projects callsBuffer(obj)to convertobjto a buffer thus we would force them to useas_buffer()instead.
On the whole I like interfaces over inheritance (which is what Protocol provides). In terms of migration and downstream projects, one could
- Add a deprecation warning in
Bufferpointing atas_bufferfor 22.10 (removing in 22.12) - introduce the
Protocol-based approach in 22.12
Note that this is multiply-breaking because you can't isinstance(foo, SomeProtocol) unless you ask for a @runtime_checkable protocol.
I think this needs some thought. Do you envisage that SpillableBuffer vs Buffer will be a global configuration switch, or do you think that you'll want buffer creation to make spillable objects on a per-buffer basis?
We want a global configuration switch. I hope at some point we can remove the old Buffer functionality and use SpillableBuffer exclusively.
On the whole I like interfaces over inheritance
Agree, we could also embrace this fully and introduce a DeviceBufferLike Protocol that cuDF uses in combination with as_buffer()? (maybe as_device_buffer_like() is a better name?).
Then we could keep Buffer as is for now?
Thanks all for your suggestions.
I am moving forward with the Protocol approach, which means:
- We now have a new protocol
DeviceBufferLike, whichBufferimplements. as_buffer()has been renamed toas_device_buffer_like(). It is used in cuDF when we don’t really care about the actual buffer type as long as it implementsDeviceBufferLike.- I rolled back the changes to the signature of
Buffer.__init__().
Thanks for the reviews, I think I have addressed all of them?
Thanks for the reviews, I think I have addressed all of them?
Sorry, I had some still in-flight
@shwina What do think of creating a utils/config.py module instead utils/string.py https://github.com/rapidsai/cudf/pull/11447#discussion_r942147374 ?
Thanks for the reviews, I think I have addressed all of them?
Sorry, I had some still in-flight
No worries, I think I have addressed them now?
@vyasr do you have anything?
@brandon-b-miller do you have anything more?
@gpucibot merge