dask-cuda icon indicating copy to clipboard operation
dask-cuda copied to clipboard

Expand spill logging

Open pentschev opened this issue 3 years ago • 11 comments

Lately there has been growing interest from users to be capable of gathering information from Dask-CUDA spilled data. Initially https://github.com/rapidsai/dask-cuda/pull/442 added the possibility to log spilling times, that the user can query at will and get information on all spilling operations that happened. However, this is limited to the "default" spilling, and not present for on-demand/JIT-unspill. There's also no information other than total time spent per operation nor any examples on how to use it.

I believe it would be useful to have the following added:

  • [ ] Support for on-demand/JIT-unspill;
  • [ ] Information on how much data is being spilled;
  • [ ] Examples on using log spilling;
    • [ ] Bonus points for example using log spilling with PeriodicCallback;
  • [ ] Add tests.

cc @Matt711

pentschev avatar Feb 16 '22 14:02 pentschev

Also pinging @ayushdg @jnke2016 @randerzander who may have other feature requests in mind.

pentschev avatar Feb 16 '22 14:02 pentschev

FYI: I'm looking into the related problem of visualizing GPU spilling.

shwina avatar Feb 16 '22 15:02 shwina

FYI: I'm looking into the related problem of visualizing GPU spilling.

You mean you want to visualize it but there's no way to do that, or there's a problem with the current visualizer (assuming there's one, TBH I don't know if there is)?

pentschev avatar Feb 16 '22 16:02 pentschev

Keeping the conversation going. Hey, @shwina I talked with @pentschev about this issue. If I can assist you with a similar issue, I'd love to.

Matt711 avatar Feb 22 '22 03:02 Matt711

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Mar 24 '22 04:03 github-actions[bot]

The issue is still in progress. I will begin working actively on it next week.

Matt711 avatar Mar 24 '22 08:03 Matt711

Will start working on this issue next week. I was busy with getting the Dask Operator ready for release.

Matt711 avatar Apr 15 '22 16:04 Matt711

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar May 15 '22 17:05 github-actions[bot]

We depend on jit unspilling in most workflows now.

In trying to determine the right amount of GPU memory for a given workload, we'd like to know how often we spill, and how much time is spent spilling. There's not a good way to gather this information currently without manually looking at workflow profiles.

Since our profiles are for a great many jobs, that becomes an inordinately time consuming process. It would be very useful for dask-cuda to log something like: timestamp, worker_id, memory request size, spilled object size, time elapsed during spill

The above field names probably imply a misunderstanding about how spilling actually works, but I hope it conveys that with such information, we can programmatically find workloads that could be optimized to avoid spilling.

randerzander avatar Jul 18 '22 19:07 randerzander

I have been planning to implement this for JIT unspilling for some time but now that we are introducing spilling in cuDF it might be sufficient to include spill logging in cuDF?

madsbk avatar Aug 01 '22 08:08 madsbk

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Aug 31 '22 09:08 github-actions[bot]