Custom spilling handler
This PR enables objects to register a spilling handler for a specific spillable buffer. For now, this is used by RangeIndex to delete rather than spill its cached data.
Motivation
Normally, a RangeIndex, which consists of a start, stop, step, and a dtype, doesn’t use any device memory and spilling it should be a no-op. However, if the RangeIndex has been materialized, the spill manager might decide to spill the underlying buffer. This can degrade performance and increase memory usage significantly.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
Codecov Report
Base: 85.69% // Head: 85.69% // No change to project coverage :thumbsup:
Coverage data is based on head (
9a5a814) compared to base (9a5a814). Patch has no changes to coverable lines.
:exclamation: Current head 9a5a814 differs from pull request most recent head dfb0983. Consider uploading reports for the commit dfb0983 to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## branch-23.02 #12287 +/- ##
=============================================
Coverage 85.69% 85.69%
=============================================
Files 155 155
Lines 24798 24798
=============================================
Hits 21251 21251
Misses 3547 3547
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
It might be possible to do in a decorator that could replace cached_property. Would that be sufficient?
On Fri, 2 Dec 2022 at 17.58, Ashwin Srinath @.***> wrote:
@.**** commented on this pull request.
In python/cudf/cudf/core/index.py https://github.com/rapidsai/cudf/pull/12287#discussion_r1038341803:
self._start, self._stop, self._step, dtype=self.dtype)
manager = get_global_manager()I'm concerned about knowledge of the spilling manager leaking into other cuDF types like RangeIndex. Is there any way we can do this totally outside of RangeIndex?
— Reply to this email directly, view it on GitHub https://github.com/rapidsai/cudf/pull/12287#pullrequestreview-1202796309, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH6FQB2C46KAPJGHXF7QR3WLITD5ANCNFSM6AAAAAASR7ESWE . You are receiving this because you authored the thread.Message ID: @.***>
It might be possible to do in a decorator that could replace cached_property. Would that be sufficient?
Yes, I think it would be a significant improvement
It might be possible to do in a decorator that could replace cached_property. Would that be sufficient?
Yes, I think it would be a significant improvement
I have created a decorator instead, do you think this is acceptable @shwina?
Thanks! I do think this is an improvement. Some questions/thoughts:
-
cached_property_delete_column_when_spilledneeds the property to return aColumn, but what about cached properties that possibly return buffers or arrays? Would each of those need a customcached_propertyspecialization that does things slightly differently with the return value? -
Can we aim to make the changes to
RangeIndex(and potentially other classes) even more minimal? For example:- Instead of using
cached_property_delete_when_spilled, can we universally use a custom cached property in cuDF that knows about spilling? That is, everywhere that we currently usefunctools.cached_property, can we usecudf.utils.cached_property? Or are there situations we don't want to delete cached device objects when spilled? - Instead of using a decorator, can we monkeypatch methods like
RangeIndex._valueswhen spilling is enabled? This allowsRangeIndexto be truly untouched.
Admittedly I don't think either of those are fantastic suggestions; curious what you think?
- Instead of using
cached_property_delete_column_when_spilledneeds the property to return aColumn, but what about cached properties that possibly return buffers or arrays? Would each of those need a customcached_propertyspecialization that does things slightly differently with the return value?
No, I think we can generalize the decorator to support any type. Do you know, if we are caching columns, series, arrays, or dataframes anywhere else? I am still seeing a difference between the memory usage of JIT-unspill and cudf-spilling, which could be because of caching. Note, JIT-unspill circumvent this issue because it uses .serialize() for spilling, which doesn't serialize cached data already.
Instead of using
cached_property_delete_when_spilled, can we universally use a custom cached property in cuDF that knows about spilling? That is, everywhere that we currently usefunctools.cached_property, can we usecudf.utils.cached_property? Or are there situations we don't want to delete cached device objects when spilled?
~No, this should be doable.~
There might be some cases where it is a problem, I will investigate.
@shwina, I have generalized and renamed the decorator. Now, it can be used everywhere with no ill effect.
Retargeted to 23.04
@madsbk is this still something that we want? If so, at this point should it still wait until after you've done the final merging of COW/spilling to avoid conflicting code?
@madsbk is this still something that we want? If so, at this point should it still wait until after you've done the final merging of COW/spilling to avoid conflicting code?
Yes and yes :)
The materialization of RangeIndex is something we want to avoid but let's wait until COW and spilling has been unified.
Note, the above unification is done now so this work is unblocked if desired.
While I think we may still want something like this feature eventually, as pylibcudf and cudf-polars mature and as we develop more spilling functionality in rapidsmpf, I think we are going to have to reenvision our spilling strategies in general and this approach is probably not going to be the right one since it is so specific to the spilling logic used in cudf. I think we need more general solutions in the long run.