chakra icon indicating copy to clipboard operation
chakra copied to clipboard

Do we have to do tracing in a real system when we try to analyze a 10,000 GPUs training system?

Open basicmi opened this issue 1 year ago • 3 comments

Accoring to the Astra-sim 2.0 paper, simulates based on Chakra trace, to "decouple parallelization strategies from the ASTRAsim implementation" . Does that mean we have to trace a real 10,000 GPUs AI training system before we can do simulation and analysis of the system in that scale?

Thanks

basicmi avatar Oct 31 '24 01:10 basicmi

I have the same question

191220042 avatar Nov 25 '24 02:11 191220042

Also the same question. In the Chakra paper, in section 6.2.3 Target ML Training Tasks you mention MLPs are synthetically generated and Transformers and DLRM are modeled based on real-world models and converted into the Chakra schema by the converter. Does this mean that for doing the experiments shown in Fig. 6 you had to extract ETs from running the workloads on production on the 4, 16 and 64 GPUs (three times one with each number of GPUs) ?

TomasGadea avatar Feb 13 '25 09:02 TomasGadea

same question. I can only test with its examples

Lychee-ysy avatar Mar 20 '25 08:03 Lychee-ysy