Nic Eggert

Results 29 comments of Nic Eggert

Any reason this PR was never merged?

We're still seeing this issue when running the latest Merlin image (`nvcr.io/nvidia/merlin/merlin-pytorch:22.06`), which includes CUDA 11.7, `dask-cuda==22.04`, and `pynvml==11.4.1`. Happens on both driver `515.48.07` and `510.47.03` if that makes any...

That didn't work, but setting the environment variable `export DASK_DISTRIBUTED__DIAGNOSTICS__NVML=False` did. Thanks for pointing me in the right direction.

Here's are examples for 2g.20gb and 3g.40gb instances: ``` nvidia-smi Mon Dec 4 21:04:13 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name...

Had to log into the node to run this, since normal Kubernetes containers don't seem to have the necessary permissions. ``` sudo nvidia-smi mig -lgi +-------------------------------------------------------+ | GPU instances: |...

These values are consistent across all 2g.20gb and 3g.40gb instances across a 6-node, 48 GPU Kubernetes cluster. The numbers in the original post are derived by querying prometheus for the...

For what it's worth, here's the mig-parted-config that we're providing via the GPU operator. ``` version: v1 mig-configs: "a100-80gb-x8-balanced": - devices: [0, 1, 2, 3, 4, 5] mig-enabled: true mig-devices:...

@nvidia-aalsudani Any idea what's going on here? Do you need more information?

@nikkon-dev Sorry for the slow response. Took us a while to free up a machine I could access bare metal on. I get the same results when running dcgmproftool in...