AlphaZero.jl
AlphaZero.jl copied to clipboard
Visualizing the timeline of the inference server
In order to profile and optimize the current inference server architecture and best tune its hyper-parameters for various applications, it would be very useful for AlphaZero.jl to have a mode where it outputs a debugging timeline in which it is possible to easily visualize when each worker submits an inference request, when it gets an answer, and when inference concretely runs on the GPU (along with some info on the concrete batch size that is used).
More concretely, I imagine adding a profile_inference_timeline
option to the simulate
function. When this option is used:
- Every time a worker sends an inference request or gets an answer, the current worker id and wall-clock time is recorded (along with the id of the thread hosting the current worker?)
- We also record the wall-clock time every time the inference server sends a batch to the GPU and also when it gets an answer (also logging info about the batch size that was used would be useful).
- This data could be dumped into a big JSON file and then visualized using another tool.
- One possible visualization would be a profiling timeline similar to the ones visualized with "chrome://tracing", with one track per worker and a separate track for the inference server. Maybe we could even generate some JSON output that is directly compatible with "chrome://tracing" (which is what the pytorch profiler is doing for example)
In particular, such a tool would be invaluable to investigate issues such as this one.
References on the chrome profiler format.
- https://www.gamedeveloper.com/programming/in-depth-using-chrome-tracing-to-view-your-inline-profiling-data
- https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview
What we need is a Julia implementation of https://opentelemetry.io/ (it's on my todo list ;)
I just implemented a basic version of this using chrome://tracing.
A JSON trace file can be automatically generated by combining the tracing logger defined in src/prof_utils.jl
.
An example of how to do this can be found in scripts/profile/self_play.jl
.
The generated trace can be visualized using Google Chrome's trace visualizer.
Screenshot: