gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Fatal error: concurrent map read and map write - CrashLoopBackOff

Open ujjwal opened this issue 1 year ago • 1 comments

1. Quick Debug Information

  • OS/Version - Ubuntu22.04
  • Kernel Version: 5.15.0-1045-gke
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE
  • GPU Operator Version: 22.9.1

2. Issue or feature description

GPU-Operator has been having reporting an error fatal error: concurrent map read and map write and crash looping. This is happening sporadically and preventing new GPUs nodes to be added into the cluster.

{"level":"info","ts":1711652616.7035823,"logger":"controllers.ClusterPolicy","msg":"Reconciliate ClusterPolicies after node label update","nb":1}
{"level":"info","ts":1711652616.703655,"logger":"controllers.ClusterPolicy","msg":"Kubernetes version detected","version":"v1.27.10-gke.1055000"}
fatal error: concurrent map read and map write

goroutine 216 [running]:
k8s.io/apimachinery/pkg/runtime.(*Scheme).New(0xc0002401c0, {{0x1d7bcdf, 0xa}, {0x1d762e6, 0x2}, {0x1905e27, 0xd}})
	/workspace/vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:296 +0x65
sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).objectTypeForListObject(0xc00049d710, {0x2073490?, 0xc0002cbb90})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:119 +0x3dd
sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).List(0xc00049d710, {0x206a408, 0xc00024cdc0}, {0x2073490, 0xc0002cbb90}, {0x2f8bbc0, 0x0, 0x0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:75 +0x65
sigs.k8s.io/controller-runtime/pkg/client.(*client).List(0xc0004b86c0, {0x206a408, 0xc00024cdc0}, {0x2073490?, 0xc0002cbb90?}, {0x2f8bbc0, 0x0, 0x0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:365 +0x4c5
github.com/NVIDIA/gpu-operator/controllers.addWatchNewGPUNode.func1({0x206a408, 0xc00024cdc0}, {0xc001a09e20?, 0x424f05?})
	/workspace/controllers/clusterpolicy_controller.go:264 +0x8c
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).mapAndEnqueue(0xc00160db40?, {0x206a408?, 0xc00024cdc0?}, {0x2073cc0, 0xc0007463a0}, {0x20821a8?, 0xc000e2a440?}, 0xc00160dbc8?)
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:81 +0x59
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).Create(0x206a408?, {0x206a408, 0xc00024cdc0}, {{0x20821a8?, 0xc000e2a440?}}, {0x2073cc0, 0xc0007463a0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:58 +0xe5
sigs.k8s.io/controller-runtime/pkg/internal/source.(*EventHandler).OnAdd(0xc0003c4140, {0x1d402e0?, 0xc000e2a440})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/event_handler.go:88 +0x27c
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
	/workspace/vendor/k8s.io/client-go/tools/cache/controller.go:243
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:973 +0x13e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0005ddf38?, {0x204fec0, 0xc001602000}, 0x1, 0xc001600000)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x3a6b222c7d7d7b3a?, 0x3b9aca00, 0x0, 0x69?, 0x227b3a227d225c67?)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0005d4990)
	/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:967 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 181
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73

goroutine 1 [select]:
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).Start(0xc000622820, {0x206a408, 0xc000482aa0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:509 +0x825
main.main()
	/workspace/main.go:176 +0xea8

ujjwal avatar Mar 28 '24 20:03 ujjwal

+1, seeing this error with v23.9.1 & v23.9.2 in Vanilla K8s

age9990 avatar Mar 29 '24 02:03 age9990

+1, also seeing this error with v23.9.1

2024-04-03 16:29:36.737Z fatal error: concurrent map read and map write
2024-04-03 16:29:36.740Z 
2024-04-03 16:29:36.740Z goroutine 451 [running]:
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/runtime.(*Scheme).New(0xc0003642a0, {{0x1d7bcdf, 0xa}, {0x1d762e6, 0x2}, {0x1905e27, 0xd}})
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:296 +0x65
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).objectTypeForListObject(0xc00080a340, {0x2073490?, 0xc00259f7a0})
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:119 +0x3dd
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).List(0xc00080a340, {0x206a408, 0xc002523d60}, {0x2073490, 0xc00259f7a0}, {0x2f8bbc0, 0x0, 0x0})
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:75 +0x65
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/client.(*client).List(0xc0005345a0, {0x206a408, 0xc002523d60}, {0x2073490?, 0xc00259f7a0?}, {0x2f8bbc0, 0x0, 0x0})
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:365 +0x4c5
2024-04-03 16:29:36.740Z github.com/NVIDIA/gpu-operator/controllers.getClusterPoliciesToReconcile({0x206a408, 0xc002523d60}, {0x2075e60, 0xc0005345a0})
2024-04-03 16:29:36.740Z        /workspace/controllers/upgrade_controller.go:321 +0xaf
2024-04-03 16:29:36.740Z github.com/NVIDIA/gpu-operator/controllers.(*UpgradeReconciler).SetupWithManager.func1({0x206a408, 0xc002523d60}, {0xc001a66060?, 0x424f05?})
2024-04-03 16:29:36.740Z        /workspace/controllers/upgrade_controller.go:248 +0x3e
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).mapAndEnqueue(0xc007a9bb40?, {0x206a408?, 0xc002523d60?}, {0x2073cc0, 0xc0004bbe00}, {0x20821a8?, 0xc005380dc0?}, 0xc007a9bbc8?)
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:81 +0x59
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).Create(0x206a408?, {0x206a408, 0xc002523d60}, {{0x20821a8?, 0xc005380dc0?}}, {0x2073cc0, 0xc0004bbe00})
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:58 +0xe5
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/internal/source.(*EventHandler).OnAdd(0xc00001fae0, {0x1d402e0?, 0xc005380dc0})
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/event_handler.go:88 +0x27c
2024-04-03 16:29:36.740Z k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/client-go/tools/cache/controller.go:243
2024-04-03 16:29:36.740Z k8s.io/client-go/tools/cache.(*processorListener).run.func1()
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:973 +0x13e
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000acef38?, {0x204fec0, 0xc007a82000}, 0x1, 0xc007a80000)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/util/wait.Until(...)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
2024-04-03 16:29:36.740Z k8s.io/client-go/tools/cache.(*processorListener).run(0xc0079e4090)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:967 +0x69
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f
2024-04-03 16:29:36.740Z created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 178
2024-04-03 16:29:36.741Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73

CecileRobertMichon avatar Apr 03 '24 16:04 CecileRobertMichon

How many workers in the cluster ? From the error log it's similar to the large scale issue.

Devin-Yue avatar Apr 05 '24 07:04 Devin-Yue

How many workers in the cluster ? From the error log it's similar to the large scale issue.

About 400 GPUs

ujjwal avatar Apr 05 '24 19:04 ujjwal

@ujjwal @age9990 @CecileRobertMichon thanks for reporting this issue. A fix for this has been merged and will be included in our next release: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/commit/59802314ef1bb947ff45978c9163db5b7c9f7e93

If you would like to test the fix out beforehand, you can use the gpu-operator image built from this commit: registry.gitlab.com/nvidia/kubernetes/gpu-operator/staging/gpu-operator:59802314-ubi8

cdesiniotis avatar Apr 10 '24 16:04 cdesiniotis

Hi all -- GPU Operator 24.3.0 has been released and contains a fix for this issue. https://github.com/NVIDIA/gpu-operator/releases/tag/v24.3.0

I am closing this issue. But please re-open if you still encountering this with 24.3.0.

cdesiniotis avatar May 02 '24 20:05 cdesiniotis