dask-cuda [DISCUSSION] Consider increasing default host memory limit per dask-cuda-worker

Several users have reported problems where dask-cuda-worker processes die in unexpected ways. After some debugging they find it's due to exceeding host memory limits, particularly when loading large training sets into GPU memory.

This is surprising for users, as it's not clear when or how a significant amount of host memory might be used, especially considering RAPIDS projects are focused on running as much as possible on GPUs.

Nov 06 '19 17:11 randerzander

@randerzander we started the conversation about this offline, could you add a bit more context/examples of when things fail for you? Also, based on the experience you had, how did you setup the memory limit that generally worked?

cc @mrocklin for visibility

Nov 08 '19 17:11 pentschev

cc @quasiben for visibility

Nov 22 '19 19:11 mrocklin

@randerzander @beckernick @VibhuJawa is this still relevant? Is there some additional information you could share as to what would better defaults look like?

May 05 '20 22:05 pentschev

Friendly nudge @randerzander @beckernick @VibhuJawa 😉

Jul 02 '20 00:07 jakirkham

Thanks for the bump John. Anecdotally, we find that the most effective setup includes setting the host memory limit as the maximum available system memory ((free -m | awk '/^Mem:/{print $2}')). I'm interested to hear if folks thinks the system maximum is too high for a default.

Jul 02 '20 02:07 beckernick

IMO, this would be too dangerous for a default. It seems that this was the best setup for TPCx-BB which was running in an exclusive environment, but this is not gonna be the case for every dask-cuda user, for instance running such a setup on a desktop being shared with other running applications may render the system very unstable due to main memory going completely full.

Jul 02 '20 07:07 pentschev

IMO, this would be too dangerous for a default. ...

I agree with Peter here. What's most effective for a given workflow doesn't necessarily translate to what's most effective for a default. A quick thought, though:

Naively, I'd expect Dask to start spilling at 60/70% host memory capacity, and then terminate at 95%. This feels to me like a good default for termination. We've made a lot of changes since last November. Is exceeding host memory while reading large files still as big of an issue? Is it possible this was related to spilling issues rather than host memory capacity issues?

Jul 02 '20 14:07 beckernick

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

Feb 16 '21 19:02 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

May 17 '21 19:05 github-actions[bot]