community
community copied to clipboard
Developer documentation
In an off-line discussion about technical debt and code complexity the valid concern was raised that many of our internal systems are not properly documented.
One example that came up is the current/new state machine (https://github.com/dask/distributed/issues/4413 https://github.com/dask/distributed/pull/5046) which is documented to some extend (https://distributed.dask.org/en/stable/scheduling-state.html and https://distributed.dask.org/en/stable/worker.html#internal-scheduling) but likely not sufficiently for another developer to make educated judgment calls about code changes.
I would like to collect topics, mostly for dask/dask
and dask/distributed
where more extensive developer documentation would help either onboarding new developers or help existing developers to familiarize themselves with other areas of the code.
cc @jcrist @jrbourbeau @gjoseph92 @ncclementi
- [x] https://github.com/dask/distributed/issues/5413
- [ ] https://github.com/dask/distributed/issues/5414
- [ ] https://github.com/dask/distributed/issues/5415
- [ ] https://github.com/dask/dask/issues/7755
- [ ] https://github.com/dask/distributed/issues/5416
- [ ] https://github.com/dask/distributed/issues/5417
Thanks for opening this @fjetter!
A few topics that come to mind:
- Task states and and valid state transitions and how those are handled in the scheduler
- The worker state machine and how it relates to the above
- The path from dask collection -> HLG -> low level graph -> scheduler -> tasks (we have some docs on this already, but again probably not enough or easily discovered)
- Networking in distributed. What talks to what, and in what direction? Are multiple interfaces supported? What are the different comm types? Any security implications?
- Disk spilling/memory management. When does data move on the worker, and how is this configured?
- Cythonization in the scheduler. How is this project going, how is it configured and applied, ... (perhaps this is in an active issue?)
I would add implementing Cluster
classes to that list. Maybe custom adaptive classes too.
High level graphs are another area that have been mentioned as needing better developer docs. There is a tracking issue here: https://github.com/dask/dask/issues/7755
Disk spilling/memory management. When does data move on the worker, and how is this configured?
https://distributed.dask.org/en/stable/worker.html#memory-management
Is this sufficient? Should I create a ticket to restructure/move this?
I created dedicated issues for the topics you mentioned. We can move the discussion about the individual items to the respective tickets.
Apart from further collecting topics, I would be curious about how we want to structure these new or already existing sections. I already realized, while researching the topic on our current docs, that some of the information asked here is already partially documented under "Developer Documentation" while other are in "Build understanding". This might be a judgement call for individual topics but if there are general best practices to follow, this can be discussed here as well.