rccl
rccl copied to clipboard
RCCL Replayer update
Details
Revised RCCL Replayer, also adding structured logging.
Work item: Internal
What were the changes?
- adding replayer as a part of RCCL workload which capture RCCL API called, param used, and GPU context info (e.g. graph, device, timestamp, etc.)
- support binary and json format output
- some rudimentary modification to parsing in replayer, requires further change to replay logic
- supports recording of collective calls, group and communicator operation except comm split
Why were the changes made?
Current RCCL Replayer relies on NCCL DEBUG output, which may not be so easily parsable depending on workloads (given its many level and subsys). It also only supports collectives and don't accurately replay other RCCL operation and context. In general we needed more detailed logging option of RCCL beyond different levels of NCCL DEBUG log, as well as more powerful and efficient RCCL Replayer.
How was the outcome achieved?
Instead of relying on NCCL log we now have our own tailored replayer structures and objects as part of RCCL and captures required info as needed. Replaying part already existed inside tools/rccl_replayer, further work is required for parsing and replaying the new format of log