rccl icon indicating copy to clipboard operation
rccl copied to clipboard

RCCL Replayer update

Open AtlantaPepsi opened this issue 7 months ago • 1 comments

Details

Revised RCCL Replayer, also adding structured logging.

Work item: Internal

What were the changes?

  • adding replayer as a part of RCCL workload which capture RCCL API called, param used, and GPU context info (e.g. graph, device, timestamp, etc.)
  • support binary and json format output
  • some rudimentary modification to parsing in replayer, requires further change to replay logic
  • supports recording of collective calls, group and communicator operation except comm split

Why were the changes made?
Current RCCL Replayer relies on NCCL DEBUG output, which may not be so easily parsable depending on workloads (given its many level and subsys). It also only supports collectives and don't accurately replay other RCCL operation and context. In general we needed more detailed logging option of RCCL beyond different levels of NCCL DEBUG log, as well as more powerful and efficient RCCL Replayer.

How was the outcome achieved?
Instead of relying on NCCL log we now have our own tailored replayer structures and objects as part of RCCL and captures required info as needed. Replaying part already existed inside tools/rccl_replayer, further work is required for parsing and replaying the new format of log

AtlantaPepsi avatar Mar 14 '25 19:03 AtlantaPepsi