DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

`dcgmproftester11` does not work with MIG instances on A100

Open bergentruckung opened this issue 1 year ago • 3 comments

Hello,

We have some A100 GPUs that are split into 3x2g.20gb variants. When we try to run dcgmproftester11 as non-root user, it errors out with the following:

> dcgmproftester11 --no-dcgm-validation -t 1001 -d 300
CacheManager Init Failed. Error: -29                             

... when we run the same as root user, it works:

> sudo dcgmproftester11 --no-dcgm-validation -t 1001 -d 300   
Skipping CreateDcgmGroups() since DCGM validation is disabled            
Skipping CreateDcgmGroups() since DCGM validation is disabled            
Skipping CreateDcgmGroups() since DCGM validation is disabled
^C           

Have you seen this before? Note that it works fine on our non-MIG supported GPUs (T4s) as non-root user.

Thanks in advance.

bergentruckung avatar May 03 '23 14:05 bergentruckung

Missed to add something to the previous comment - we're on DCGM 3.1.7.

bergentruckung avatar May 03 '23 14:05 bergentruckung

Tagging @nikkon-dev since you've responded to previous queries. Feel free to loop in someone else if you feel that's better.

Thanks in advance.

bergentruckung avatar May 03 '23 19:05 bergentruckung

Hi team, kindly have a look at this issue. Thanks.

iprakhar22 avatar May 08 '23 13:05 iprakhar22