training icon indicating copy to clipboard operation
training copied to clipboard

[DLRM v2] Using the model for the inference reference implementation

Open pgmpablo157321 opened this issue 1 year ago • 6 comments

I am currently making the reference implementation and am stuck deploying the model in multiple GPUs.

Here is a link to the PR: https://github.com/mlcommons/inference/pull/1373 Here is a link to the file where the model is: https://github.com/mlcommons/inference/blob/7c64689b261f97a4fc3410bff584ac2439453bcc/recommendation/dlrm_v2/pytorch/python/backend_pytorch_native.py

Currently this works for a debugging model and a single GPU, but fails when I try to run it with multiple ones. Here are the issues that I have:

  1. If I run the benchmark, it gets stuck in this line. This is because you need to run that line for each rank, but I am not able to run it, load the model in the variable and store it there to query it.
  2. Running the benchmark in CPU, I get the following error when making a prediction.
[...]'fbgemm' object has no attribute 'jagged_2d_to_dense' (this happens when importing torchrec)

or

[...]fbgemm object has no attribute 'bounds_check_indices' (this happens when making a prediction)

This can be because I am trying to load a sharded model in a different number of ranks. Do you know if that could be related if thats related?

I have tried with pytorch versions 1.12, 1.13, 2.0.0, 2.0.1 and fbgemm version 0.3.2 and 0.4.1

pgmpablo157321 avatar May 17 '23 00:05 pgmpablo157321

Hi Pablo, You need to install fbgemm-gpu-cpu==0.3.2 to avoid this error.

yuankuns avatar May 17 '23 00:05 yuankuns

Already have this version, but the error persist

Name: fbgemm-gpu-cpu
Version: 0.3.2
Summary: 
Home-page: https://github.com/pytorch/fbgemm
Author: FBGEMM Team
Author-email: [email protected]
License: BSD-3
Location: /opt/conda/lib/python3.7/site-packages
Requires: 
Required-by:

pgmpablo157321 avatar May 17 '23 04:05 pgmpablo157321

Have you tried to remove fbgemm-gpu as well?

yuankuns avatar May 17 '23 14:05 yuankuns

@yuankuns When i try to remove the fbgemm-gpu, the following import error:

ModuleNotFoundError: No module named 'fbgemm_gpu'

I managed to run the cpu version with fbgemm-gpu-cpu==0.3.2 fbgemm-gpu==0.4.1 pytorch==1.13.1 in a machine with gpu. Without gpu I get an fbgemm error like the ones I posted before

pgmpablo157321 avatar May 17 '23 21:05 pgmpablo157321

@pgmpablo157321 It's interesting, since there is no GPU on our server, and it (only fbgemm-gpu-cpu==0.3.2) work for our case.

yuankuns avatar May 17 '23 21:05 yuankuns

@pgmpablo157321 is this still an issue?

ShriyaPalsamudram avatar Jul 31 '24 20:07 ShriyaPalsamudram