aishwaryaraimule21

Results 4 comments of aishwaryaraimule21

hi team, Total memory reported by nvidia-smi > DCGM_FI_DEV_FB_USED+ DCGM_FI_DEV_FB_FREE. I see that we are missing DCGM_FI_DEV_FB_RESERVED. I understand we can leverage https://github.com/NVIDIA/gpu-operator/pull/949/files to configure a list of custom metrics...

@andreyvelich I have signed the DCO. Please check. Thanks.

@andreyvelich I have tested the distributed training workflow using an older trainer image of `release-1.9` branch. With the latest trainer package, I am running into an OOM error for the...

> Do you want to try to update other packages and try it again @aishwaryaraimule21 ? Yes, @andreyvelich. Let me try updating other packages. I tried running this example for...