inference icon indicating copy to clipboard operation
inference copied to clipboard

Report Total Number of Accelerators for multi-host submissions

Open yeandy opened this issue 1 year ago • 0 comments

During the submission process, the summary CSV that is generated from the https://github.com/mlcommons/inference/blob/master/tools/submission/generate_final_report.py script reports Nodes and a#, where Nodes comes from number_of_nodes field (https://github.com/mlcommons/inference/blob/master/tools/submission/generate_final_report.py#L35) and a# comes from the accelerators_per_node field (https://github.com/mlcommons/inference/blob/master/tools/submission/generate_final_report.py#L38).

The intent of the summary script is to show the total number of accelerators. The current logic is fine with single-node submissions, but can be confusing from multi-node submissions. For example, a submission that uses 4 VMs/Nodes, each with 4 accelerator chips, will report Nodes as 4 and a# as 4, which is not wrong, but confusing. There should be somewhere reference of 16.

Suggestion: We can have another column with total_a to , or rename a# to report 16 accelerators total, where the script calculates 16 by multiplying number_of_nodes * accelerators_per_node.

yeandy avatar Aug 20 '24 16:08 yeandy