DCGM
DCGM copied to clipboard
`dcgmproftester11` does not work with MIG instances on A100
Hello,
We have some A100 GPUs that are split into 3x2g.20gb variants. When we try to run dcgmproftester11
as non-root user, it errors out with the following:
> dcgmproftester11 --no-dcgm-validation -t 1001 -d 300
CacheManager Init Failed. Error: -29
... when we run the same as root user, it works:
> sudo dcgmproftester11 --no-dcgm-validation -t 1001 -d 300
Skipping CreateDcgmGroups() since DCGM validation is disabled
Skipping CreateDcgmGroups() since DCGM validation is disabled
Skipping CreateDcgmGroups() since DCGM validation is disabled
^C
Have you seen this before? Note that it works fine on our non-MIG supported GPUs (T4s) as non-root user.
Thanks in advance.
Missed to add something to the previous comment - we're on DCGM 3.1.7.
Tagging @nikkon-dev since you've responded to previous queries. Feel free to loop in someone else if you feel that's better.
Thanks in advance.
Hi team, kindly have a look at this issue. Thanks.