training icon indicating copy to clipboard operation
training copied to clipboard

Does DLRM_v2 support H100?

Open xyyintel opened this issue 1 year ago • 2 comments

Does DLRM_v2 support H100? If supported, what is the env you used? I have tried cuda11.8 + pytorch 1.14.0 or pytorch 2.1 + torchrec 0.3.2 or torchrec 0.4.0 + fbgemm_gpu 0.3.2 or 0.4.1. However, none of above env works.

xyyintel avatar Apr 11 '23 03:04 xyyintel

We never got to test this on H100 I think. cc @janekl if you've tried on H100.

erichan1 avatar Apr 28 '23 22:04 erichan1

Right, the development and testing involved only A100.

To achieve this at least you would need CUDA 12 and compile FBGEMM for Hopper architecture (SM90). But I have never tried this myself.

janekl avatar May 04 '23 14:05 janekl

Closing as the reference was not tested on H100s Note that there were multiple H100 DLRMv2 submissions in the MLPerf Training v4.0 round as shown in the results table.

Training v4.0 implementations are in this repo

ShriyaPalsamudram avatar Jul 31 '24 20:07 ShriyaPalsamudram