gpu-operator
gpu-operator copied to clipboard
Unsupported MIG Device specified 7g.80gb, expected 7g.79gb instead
I'm running into a strange error when trying to set up a 7g.80gb MIG config on my nodes. The nodes have 8x NVIDIA-A100-SXM4-80GB. I've installed the GPU operator from the Helm chart (v1.10.1) with these values:
- name: "nfd.enabled"
value: "true"
- name: "mig.strategy"
value: "mixed"
- name: "driver.version"
value: "510.47.03"
- name: "driver.rdma.enabled"
value: "false"
- name: "toolkit.enabled"
value: "true"
- name: "toolkit.version"
value: "v1.9.0-centos7"
- name: "migManager.enabled"
value: "true"
Then, I'm labeling a node with the all-7g.80gb
profile
kubectl label nodes g660018773 nvidia.com/mig.config=all-7g.80gb
This gives a strange error in the nvidia-mig-manager
pod logs: "Unsupported MIG Device specified 7g.80gb, expected 7g.79gb instead". Full logs are below.
If I create my own configmap and specify 7g.79gb
, things seem to work fine. In this case, the device still shows up as 7g.80gb
in the node information. So I guess there's a workaround, but it definitely seems like there's a bug somewhere.
Full log:
time="2022-05-11T16:53:16Z" level=info msg="Updating to MIG config: all-7g.80gb"
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm=true'
Asserting that the requested configuration is present in the configuration file
Selected MIG configuration is valid
Getting current value of the 'nvidia.com/mig.config.state' node label
Current value of 'nvidia.com/mig.config.state=success'
Checking if the selected MIG config is currently applied or not
time="2022-05-11T16:53:16Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
Checking if the MIG mode setting in the selected config is currently applied or not
If the state is 'rebooting', we expect this to always return true
time="2022-05-11T16:53:16Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
Changing the 'nvidia.com/mig.config.state' node label to 'pending'
node/g660018773 labeled
Shutting down all GPU clients in Kubernetes by disabling their component-specific nodeSelector labels
node/g660018773 labeled
Waiting for the device-plugin to shutdown
pod/nvidia-device-plugin-daemonset-qvxzw condition met
Waiting for gpu-feature-discovery to shutdown
pod/gpu-feature-discovery-gsk5d condition met
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Applying the MIG mode change from the selected config to the node
If the -r option was passed, the node will be automatically rebooted if this is not successful
time="2022-05-11T16:53:30Z" level=debug msg="Parsing config file..."
time="2022-05-11T16:53:30Z" level=debug msg="Selecting specific MIG config..."
time="2022-05-11T16:53:30Z" level=debug msg="Running apply-start hook"
time="2022-05-11T16:53:30Z" level=debug msg="Checking current MIG mode..."
time="2022-05-11T16:53:30Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-05-11T16:53:30Z" level=debug msg=" GPU 0: 0x20B210DE"
time="2022-05-11T16:53:30Z" level=debug msg=" Asserting MIG mode: Enabled"
time="2022-05-11T16:53:30Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:53:30Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-05-11T16:53:30Z" level=debug msg="Running pre-apply-mode hook"
time="2022-05-11T16:53:30Z" level=debug msg="Applying MIG mode change..."
time="2022-05-11T16:53:30Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-05-11T16:53:30Z" level=debug msg=" GPU 0: 0x20B210DE"
time="2022-05-11T16:53:30Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:53:30Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-05-11T16:53:30Z" level=debug msg=" Updating MIG mode: Enabled"
time="2022-05-11T16:53:35Z" level=debug msg=" Mode change pending: false"
time="2022-05-11T16:53:35Z" level=debug msg=" GPU 1: 0x20B210DE"
time="2022-05-11T16:53:35Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:53:35Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-05-11T16:53:35Z" level=debug msg=" Updating MIG mode: Enabled"
time="2022-05-11T16:53:39Z" level=debug msg=" Mode change pending: false"
time="2022-05-11T16:53:39Z" level=debug msg=" GPU 2: 0x20B210DE"
time="2022-05-11T16:53:39Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:53:39Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-05-11T16:53:39Z" level=debug msg=" Updating MIG mode: Enabled"
time="2022-05-11T16:53:43Z" level=debug msg=" Mode change pending: false"
time="2022-05-11T16:53:43Z" level=debug msg=" GPU 3: 0x20B210DE"
time="2022-05-11T16:53:43Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:53:43Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-05-11T16:53:43Z" level=debug msg=" Updating MIG mode: Enabled"
time="2022-05-11T16:53:48Z" level=debug msg=" Mode change pending: false"
time="2022-05-11T16:53:48Z" level=debug msg=" GPU 4: 0x20B210DE"
time="2022-05-11T16:53:48Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:53:48Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-05-11T16:53:48Z" level=debug msg=" Updating MIG mode: Enabled"
time="2022-05-11T16:53:52Z" level=debug msg=" Mode change pending: false"
time="2022-05-11T16:53:52Z" level=debug msg=" GPU 5: 0x20B210DE"
time="2022-05-11T16:53:52Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:53:52Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-05-11T16:53:52Z" level=debug msg=" Updating MIG mode: Enabled"
time="2022-05-11T16:53:56Z" level=debug msg=" Mode change pending: false"
time="2022-05-11T16:53:56Z" level=debug msg=" GPU 6: 0x20B210DE"
time="2022-05-11T16:53:56Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:53:56Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-05-11T16:53:56Z" level=debug msg=" Updating MIG mode: Enabled"
time="2022-05-11T16:54:01Z" level=debug msg=" Mode change pending: false"
time="2022-05-11T16:54:01Z" level=debug msg=" GPU 7: 0x20B210DE"
time="2022-05-11T16:54:01Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:54:01Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-05-11T16:54:01Z" level=debug msg=" Updating MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" Mode change pending: false"
time="2022-05-11T16:54:05Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
Applying the selected MIG config to the node
time="2022-05-11T16:54:05Z" level=debug msg="Parsing config file..."
time="2022-05-11T16:54:05Z" level=debug msg="Selecting specific MIG config..."
time="2022-05-11T16:54:05Z" level=debug msg="Running apply-start hook"
time="2022-05-11T16:54:05Z" level=debug msg="Checking current MIG mode..."
time="2022-05-11T16:54:05Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 0: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:54:05Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 1: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:54:05Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 2: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:54:05Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 3: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:54:05Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 4: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:54:05Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 5: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:54:05Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 6: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:54:05Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 7: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:54:05Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-05-11T16:54:05Z" level=debug msg="Checking current MIG device configuration..."
time="2022-05-11T16:54:05Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 0: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG config: map[7g.80gb:1]"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 1: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG config: map[7g.80gb:1]"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 2: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG config: map[7g.80gb:1]"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 3: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG config: map[7g.80gb:1]"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 4: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG config: map[7g.80gb:1]"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 5: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG config: map[7g.80gb:1]"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 6: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG config: map[7g.80gb:1]"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 7: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" Asserting MIG config: map[7g.80gb:1]"
time="2022-05-11T16:54:05Z" level=debug msg="Running pre-apply-config hook"
time="2022-05-11T16:54:05Z" level=debug msg="Applying MIG device configuration..."
time="2022-05-11T16:54:05Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-05-11T16:54:05Z" level=debug msg=" GPU 0: 0x20B210DE"
time="2022-05-11T16:54:05Z" level=debug msg=" MIG capable: true\n"
time="2022-05-11T16:54:05Z" level=debug msg=" Updating MIG config: map[7g.80gb:1]"
time="2022-05-11T16:54:06Z" level=error msg="Unsupported MIG Device specified 7g.80gb, expected 7g.79gb instead"
time="2022-05-11T16:54:06Z" level=debug msg="Running apply-exit hook"
time="2022-05-11T16:54:06Z" level=fatal msg="Error setting MIGConfig: error attempting multiple config orderings: all orderings failed"
Restarting any GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
node/g660018773 unlabeled
Changing the 'nvidia.com/mig.config.state' node label to 'failed'
node/g660018773 unlabeled
time="2022-05-11T16:54:06Z" level=error msg="Error: exit status 1"
This could happen if this calculation gives back the wrong value (in this case 79): https://github.com/NVIDIA/mig-parted/blob/main/pkg/types/mig_profile.go#L48
Would need to dig into why this would happen. Unfortunately there is no way to query this name from the driver, but (I just reconfirmed) this is the calculation the driver uses, so need to look deeper into what is going on.
Okay, any additional information you need that would help debug?
Actually I was mistaken. When configured as 7g.79.gb
it does consistently show up everywhere as 7g.79gb
.
Hi @neggert. We have just released v0.13.0
. This includes a fix to ensure that the resource names are generated consistently for MIG profiles. Please give it a go and let us know if you have any additional issues.
Thank you!
Thank you!
Hi, did you solve the problem ?