Do we have to do tracing in a real system when we try to analyze a 10,000 GPUs training system?
Accoring to the Astra-sim 2.0 paper, simulates based on Chakra trace, to "decouple parallelization strategies from the ASTRAsim implementation" . Does that mean we have to trace a real 10,000 GPUs AI training system before we can do simulation and analysis of the system in that scale?
Thanks
I have the same question
Also the same question. In the Chakra paper, in section 6.2.3 Target ML Training Tasks you mention MLPs are synthetically generated and Transformers and DLRM are modeled based on real-world models and converted into the Chakra schema by the converter. Does this mean that for doing the experiments shown in Fig. 6 you had to extract ETs from running the workloads on production on the 4, 16 and 64 GPUs (three times one with each number of GPUs) ?
same question. I can only test with its examples