AlphaZero.jl icon indicating copy to clipboard operation
AlphaZero.jl copied to clipboard

Visualizing the timeline of the inference server

Open jonathan-laurent opened this issue 3 years ago • 4 comments

In order to profile and optimize the current inference server architecture and best tune its hyper-parameters for various applications, it would be very useful for AlphaZero.jl to have a mode where it outputs a debugging timeline in which it is possible to easily visualize when each worker submits an inference request, when it gets an answer, and when inference concretely runs on the GPU (along with some info on the concrete batch size that is used).

More concretely, I imagine adding a profile_inference_timeline option to the simulate function. When this option is used:

  • Every time a worker sends an inference request or gets an answer, the current worker id and wall-clock time is recorded (along with the id of the thread hosting the current worker?)
  • We also record the wall-clock time every time the inference server sends a batch to the GPU and also when it gets an answer (also logging info about the batch size that was used would be useful).
  • This data could be dumped into a big JSON file and then visualized using another tool.
  • One possible visualization would be a profiling timeline similar to the ones visualized with "chrome://tracing", with one track per worker and a separate track for the inference server. Maybe we could even generate some JSON output that is directly compatible with "chrome://tracing" (which is what the pytorch profiler is doing for example)

In particular, such a tool would be invaluable to investigate issues such as this one.

jonathan-laurent avatar Sep 10 '21 16:09 jonathan-laurent

References on the chrome profiler format.

  • https://www.gamedeveloper.com/programming/in-depth-using-chrome-tracing-to-view-your-inline-profiling-data
  • https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview

jonathan-laurent avatar Sep 10 '21 16:09 jonathan-laurent

What we need is a Julia implementation of https://opentelemetry.io/ (it's on my todo list ;)

findmyway avatar Sep 10 '21 17:09 findmyway

I just implemented a basic version of this using chrome://tracing. A JSON trace file can be automatically generated by combining the tracing logger defined in src/prof_utils.jl. An example of how to do this can be found in scripts/profile/self_play.jl. The generated trace can be visualized using Google Chrome's trace visualizer.

jonathan-laurent avatar Sep 23 '21 15:09 jonathan-laurent

Screenshot:

Screenshot from 2021-09-23 12-49-25

jonathan-laurent avatar Sep 23 '21 17:09 jonathan-laurent