add section about very large arrays
This is my attempt to distill my experience with very large arrays. I got my numbers from Matt's blog post on dask scaling limits.
I have probably said some inaccurate things that need to be corrected. This is just a first draft.
Thanks @rabernat !
Two thoughts:
- A lot of this seems general beyond dask arrays, we might want a general core best practices section that talks a bit about overhead and such.
- This makes claims that people shouldn't use Dask for datasets over several TB, and that they should use for loops instead. This was definitely true in your case (and I'm hoping that other Dask maintainers can help you resolve those problems in the near future), but it may not be true in general. I hesitate to tell people not to use Dask at this scale. Other groups do use Dask at this scale quite happily, they just have different parameters than you do. For example, they might use much larger chunk sizes, or they may not care about interactivity, and may instead be submitting batch jobs.
Instead, I wonder if we might pull out some parts of this, like encouraging people to use larger chunk sizes as graphs get large, to help users get to these larger scales more smoothly.
Both your comments make sense.
I understand that you might want to revise and moderate the way I describe dask's scaling limitations. I am totally fine with that. You have a much broader view of the landscape than I do.