[bug] StorageError: invalid object
Hi All,
We are seeing a lot of error logs when running our operator into production. And we can't figure out what is happening. The basic flow of what we do goes as follows:
func (r *SparkReconciler) createSparkSubmitterPod(ctx context.Context, spark *runtimev1.Spark) error {
var pod corev1.Pod
err := r.Get(ctx, client.ObjectKey{
Name: submitterPodName(spark),
Namespace: spark.Namespace,
}, &pod)
if apierrors.IsNotFound(err) {
r.Log.V(1).Info("Creating the spark submitter pod")
return r.doCreateSparkSubmitterPod(ctx, spark)
}
return errors.Wrap(err, "Failed getting the pod")
}
As you can see in the code snippet. We create a pod only when it does not exist. Because we read from cache we know that the get call can return not found when the pod exists. We do the following in the doCreateSparkSubmitterPod:
if err := r.Client.Create(ctx, pod); err != nil {
if apierrors.IsAlreadyExists(err) {
return nil
}
return errors.Wrap(err, "Failed creating the pod")
}
So we capture an already exists error. However we tend to see a lot of errors like this when running into production:
"error":"Operation cannot be fulfilled on pods \"255f70af-5699-46c6-8002-a4df45af5209\": StorageError: invalid object, Code: 4, Key: /registry/pods/addatatest2/255f70af-5699-46c6-8002-a4df45af5209, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: f8104e60-1a9d-40ca-847e-52bc0f556844, UID in object meta: "
We have no clue on how to proceed on this. I think it might happen when the pod already exists. But not entirely sure. Perhaps someone can shed a light on this?
I also had the same problem
I also met this problem with operating crd
Looks like an error when APIServer tried to store the object into Etcd. Maybe you should create an issue in k/k to sig-api-machinery.
This maybe be due to the ResourceVersion field
Is there any update on this?
Same problem, any update?
Same problem here, any update on this?
I am not familiar with the problem that much, but I can offer an opinion. Sometimes when you work with pods, they have a graceful termination period. During that time, the pod is updated with DeletionTimestamp, but still exists.
You might guard against that by checking pod.ObjectMeta.DeletionTimestamp.IsZero()
I hope this helps :relaxed:
This isn't a controller runtime bug or issue
/close
@vincepri: Closing this issue.
In response to this:
This isn't a controller runtime bug or issue
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I'd suggest to open an issue in kubernetes/kubernetes instead if the problem persists
Same problem, any update?
Hey @Akiqqqqqqq, I'll try to help here
When you set up your controller, try using: WithEventFilter(predicate.ResourceVersionChangedPredicate{})
@nikola-jokic Can you elaborate on how that could help?
As per https://github.com/kubernetes-sigs/controller-runtime/issues/1740#issuecomment-992062994 this is meant to filter out the resync events (which come with objects with the same ResourceVersion as already observed).
I'm running into this as well, and adding FlakeAttempts(5), that test passes.. I can't seem to figure out what the race condition is. I'm deleting and recreating my object JustBefore each test and I think I'm being pretty anal about it
JustAfterEach(func() {
Expect(k8sClient.Delete(ctx, fooObj)).To(Succeed())
Eventually(func() bool {
return errors.IsNotFound(
k8sClient.Get(ctx, client.ObjectKeyFromObject(fooObj), &wgcniv1alpha1.Foo{}),
)
}, timeout, interval).Should(BeTrue())
Consistently(func() bool {
return errors.IsNotFound(
k8sClient.Get(ctx, client.ObjectKeyFromObject(fooObj), &foov1alpha1.Foo{}),
)
}, duration, interval).Should(BeTrue())
By("Deleted the foo object")
Thus, the object should be entirely deleted before the test runs... however, if I run it by itself or with that Flakiness flag, it works... if I don't, then it fails with this error. This definitely feels like a bug somewhere deeper down and not necessarily in the controller.
While I agree it might not be a bug in controller-runtime itself, it definitely seems to be a flaw in the basic testing setup and documented best practices, so any help in debugging this would be hugely helpful.
Just kidding. Turns out wrapping my assertions in an Eventually fixed this. I admit it's still a little disturbing that I'm getting this error in these tests, but I think I've gotten past the main issue.
@futurist @pmalek @wreed4 @vincepri @stijndehaes I also encountered a similar issue. "name":"image-describe","namespace":"heros-us er"}, "namespace": "heros-user", "name": "image-describe", "reconcileID": "0dd05b6b-1e4d-42be-ab64-3b6bc7b44b37", "error": "Operation cannot be fulfilled on servers.sy.hedd.ai "image-describe": StorageError: in valid object, Code: 4, Key: /registry/sy.hedd.ai/servers/heros-user/image-describe, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: c8e7ef3b-cdf8-4e3c-a725-f42dc4a61568, UID in o bject meta: "}, It seems that this issue has not been resolved yet. Can someone tell me what's going on? I have no clue, and sometimes during reconciliation, the same state is reconciled twice, but there are no error messages, and requeue is also false. It feels very strange.
The closest upstream issue I could find after a quick search is https://github.com/kubernetes/kubernetes/issues/82130 which was fixed/closed a while back.
If this continues, include the Kubernetes and controller-runtime version; and more information on when this error occurs. To be clear, so far this issue seems outside of controller-runtime and we haven't been able to reproduce.
One thing to note, it seems that all of these precondition are failing with UID in object meta: , note that the UID is empty in this case, which might suggest a stale api-server cache; or race condition between client and server.
If someone can provide a minimal repo which can reproduce this issue we have a chance to further investigate :)
Just hit this issue this week and I don't know if it's the same problem as everyone else had, but mine seems to be a bad error message type propagating. I'm unsure if it's a controller-runtime problem or k8s not propagating the proper error.
Unfortunately, I created a test case to replicate it, and the error changed to be more meaningful(NotFound). Let me try to explain what I did.
Given 2 CRDs:
- Book
- Chapter
When a Book custom resource is created, the controller attempts to do 2 things:
- Create a new namespace
- Create a chapter in the newly created namespace
If you try to set a OwnerReference to the chapter that points at the Book (which shouldn't be allowed since it's different namespaces), you get the UID pre-condition failure.
One thing to note though, is that even if the chapter isn't created in the test case, the controller for the Chapter custom resource is still triggered with a valid custom resource name.
manager 2024-04-17T11:48:53Z ERROR Reconciler error {"controller": "chapter", "controllerGroup": "author.tutorial.kubebuilder.io", "controllerKind": "Chapter", "Chapter": {"name":"chapter-6hrnr","namespace":"break-frtjw"}, "namespace": "break-frtjw", "name": "chapter-6hrnr", "reconcileID": "a6f9c22a-7bd7-4685-8c2d-97d3bc759b61", "error": "Chapter.author.tutorial.kubebuilder.io \"chapter-6hrnr\" not found"}
manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
manager /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329
manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
manager /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
manager /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
Here's the test case if someone is interested: https://github.com/pier-oliviert/uid-precondition-issue
I hope this helps someone.
Folks, this is guaranteed not to be a controller-runtime issue but an issue in main Kubernetes. Please, open a bug report against the kubernetes/kubernetes repository.
Perhaps, but my interface with this problem is controller-runtime. Also, there's a valid argument to be made that people land here looking for a resolution on this specific issue. Whether or not this is a Kubernetes issue isn't relevant to them (or me).
I'll make a test case for the main repo when I have time.
/transfer kubernetes/kubernetes
@alvaroaleman: Something went wrong or the destination repo kubernetes-sigs/kubernetes/kubernetes does not exist.
In response to this:
/transfer kubernetes/kubernetes
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Opened https://github.com/kubernetes/kubernetes/issues/124347 for you guys. Please continue the conversation there.