controller-runtime icon indicating copy to clipboard operation
controller-runtime copied to clipboard

[bug] StorageError: invalid object

Open stijndehaes opened this issue 3 years ago • 6 comments

Hi All,

We are seeing a lot of error logs when running our operator into production. And we can't figure out what is happening. The basic flow of what we do goes as follows:

func (r *SparkReconciler) createSparkSubmitterPod(ctx context.Context, spark *runtimev1.Spark) error {
	var pod corev1.Pod
	err := r.Get(ctx, client.ObjectKey{
		Name:      submitterPodName(spark),
		Namespace: spark.Namespace,
	}, &pod)
	if apierrors.IsNotFound(err) {
		r.Log.V(1).Info("Creating the spark submitter pod")
		return r.doCreateSparkSubmitterPod(ctx, spark)
	}
	return errors.Wrap(err, "Failed getting the pod")
}

As you can see in the code snippet. We create a pod only when it does not exist. Because we read from cache we know that the get call can return not found when the pod exists. We do the following in the doCreateSparkSubmitterPod:

if err := r.Client.Create(ctx, pod); err != nil {
    if apierrors.IsAlreadyExists(err) {
	return nil
    }
    return errors.Wrap(err, "Failed creating the pod")
}

So we capture an already exists error. However we tend to see a lot of errors like this when running into production:

"error":"Operation cannot be fulfilled on pods \"255f70af-5699-46c6-8002-a4df45af5209\": StorageError: invalid object, Code: 4, Key: /registry/pods/addatatest2/255f70af-5699-46c6-8002-a4df45af5209, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: f8104e60-1a9d-40ca-847e-52bc0f556844, UID in object meta: "

We have no clue on how to proceed on this. I think it might happen when the pod already exists. But not entirely sure. Perhaps someone can shed a light on this?

stijndehaes avatar Apr 28 '22 15:04 stijndehaes

I also had the same problem

anurnomeru avatar May 13 '22 00:05 anurnomeru

I also met this problem with operating crd

Icarus9913 avatar Jul 06 '22 07:07 Icarus9913

Looks like an error when APIServer tried to store the object into Etcd. Maybe you should create an issue in k/k to sig-api-machinery.

FillZpp avatar Jul 06 '22 09:07 FillZpp

This maybe be due to the ResourceVersion field

yhxjack avatar Sep 09 '22 06:09 yhxjack

Is there any update on this?

chenliu1993 avatar Sep 24 '22 09:09 chenliu1993

Same problem, any update?

vincent-pli avatar Sep 29 '22 01:09 vincent-pli

Same problem here, any update on this?

futurist avatar Nov 25 '22 02:11 futurist

I am not familiar with the problem that much, but I can offer an opinion. Sometimes when you work with pods, they have a graceful termination period. During that time, the pod is updated with DeletionTimestamp, but still exists.

You might guard against that by checking pod.ObjectMeta.DeletionTimestamp.IsZero()

I hope this helps :relaxed:

nikola-jokic avatar Dec 12 '22 20:12 nikola-jokic

This isn't a controller runtime bug or issue

/close

vincepri avatar Feb 02 '23 21:02 vincepri

@vincepri: Closing this issue.

In response to this:

This isn't a controller runtime bug or issue

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 02 '23 21:02 k8s-ci-robot

I'd suggest to open an issue in kubernetes/kubernetes instead if the problem persists

vincepri avatar Feb 02 '23 21:02 vincepri

Same problem, any update?

Akiqqqqqqq avatar Feb 10 '23 10:02 Akiqqqqqqq

Hey @Akiqqqqqqq, I'll try to help here

When you set up your controller, try using: WithEventFilter(predicate.ResourceVersionChangedPredicate{})

nikola-jokic avatar Feb 10 '23 10:02 nikola-jokic

@nikola-jokic Can you elaborate on how that could help?

As per https://github.com/kubernetes-sigs/controller-runtime/issues/1740#issuecomment-992062994 this is meant to filter out the resync events (which come with objects with the same ResourceVersion as already observed).

pmalek avatar Mar 03 '23 10:03 pmalek

I'm running into this as well, and adding FlakeAttempts(5), that test passes.. I can't seem to figure out what the race condition is. I'm deleting and recreating my object JustBefore each test and I think I'm being pretty anal about it

		JustAfterEach(func() {
			Expect(k8sClient.Delete(ctx, fooObj)).To(Succeed())
			Eventually(func() bool {
				return errors.IsNotFound(
					k8sClient.Get(ctx, client.ObjectKeyFromObject(fooObj), &wgcniv1alpha1.Foo{}),
				)
			}, timeout, interval).Should(BeTrue())
			Consistently(func() bool {
				return errors.IsNotFound(
					k8sClient.Get(ctx, client.ObjectKeyFromObject(fooObj), &foov1alpha1.Foo{}),
				)
			}, duration, interval).Should(BeTrue())

			By("Deleted the foo object")

Thus, the object should be entirely deleted before the test runs... however, if I run it by itself or with that Flakiness flag, it works... if I don't, then it fails with this error. This definitely feels like a bug somewhere deeper down and not necessarily in the controller.

While I agree it might not be a bug in controller-runtime itself, it definitely seems to be a flaw in the basic testing setup and documented best practices, so any help in debugging this would be hugely helpful.

wreed4 avatar Mar 31 '23 14:03 wreed4

Just kidding. Turns out wrapping my assertions in an Eventually fixed this. I admit it's still a little disturbing that I'm getting this error in these tests, but I think I've gotten past the main issue.

wreed4 avatar Mar 31 '23 15:03 wreed4

@futurist @pmalek @wreed4 @vincepri @stijndehaes I also encountered a similar issue. "name":"image-describe","namespace":"heros-us er"}, "namespace": "heros-user", "name": "image-describe", "reconcileID": "0dd05b6b-1e4d-42be-ab64-3b6bc7b44b37", "error": "Operation cannot be fulfilled on servers.sy.hedd.ai "image-describe": StorageError: in valid object, Code: 4, Key: /registry/sy.hedd.ai/servers/heros-user/image-describe, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: c8e7ef3b-cdf8-4e3c-a725-f42dc4a61568, UID in o bject meta: "}, It seems that this issue has not been resolved yet. Can someone tell me what's going on? I have no clue, and sometimes during reconciliation, the same state is reconciled twice, but there are no error messages, and requeue is also false. It feels very strange.

helloxjade avatar Mar 26 '24 01:03 helloxjade

The closest upstream issue I could find after a quick search is https://github.com/kubernetes/kubernetes/issues/82130 which was fixed/closed a while back.

If this continues, include the Kubernetes and controller-runtime version; and more information on when this error occurs. To be clear, so far this issue seems outside of controller-runtime and we haven't been able to reproduce.

vincepri avatar Mar 26 '24 14:03 vincepri

One thing to note, it seems that all of these precondition are failing with UID in object meta: , note that the UID is empty in this case, which might suggest a stale api-server cache; or race condition between client and server.

vincepri avatar Mar 26 '24 14:03 vincepri

If someone can provide a minimal repo which can reproduce this issue we have a chance to further investigate :)

sbueringer avatar Mar 26 '24 14:03 sbueringer

Just hit this issue this week and I don't know if it's the same problem as everyone else had, but mine seems to be a bad error message type propagating. I'm unsure if it's a controller-runtime problem or k8s not propagating the proper error.

Unfortunately, I created a test case to replicate it, and the error changed to be more meaningful(NotFound). Let me try to explain what I did.

Given 2 CRDs:

  • Book
  • Chapter

When a Book custom resource is created, the controller attempts to do 2 things:

  1. Create a new namespace
  2. Create a chapter in the newly created namespace

If you try to set a OwnerReference to the chapter that points at the Book (which shouldn't be allowed since it's different namespaces), you get the UID pre-condition failure.

One thing to note though, is that even if the chapter isn't created in the test case, the controller for the Chapter custom resource is still triggered with a valid custom resource name.

manager 2024-04-17T11:48:53Z    ERROR    Reconciler error    {"controller": "chapter", "controllerGroup": "author.tutorial.kubebuilder.io", "controllerKind": "Chapter", "Chapter": {"name":"chapter-6hrnr","namespace":"break-frtjw"}, "namespace": "break-frtjw", "name": "chapter-6hrnr", "reconcileID": "a6f9c22a-7bd7-4685-8c2d-97d3bc759b61", "error": "Chapter.author.tutorial.kubebuilder.io \"chapter-6hrnr\" not found"}
manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler                                                                                                                                                                                                                                                                                             
manager     /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329
manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
manager     /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
manager     /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227

Here's the test case if someone is interested: https://github.com/pier-oliviert/uid-precondition-issue

I hope this helps someone.

pier-oliviert avatar Apr 17 '24 11:04 pier-oliviert

Folks, this is guaranteed not to be a controller-runtime issue but an issue in main Kubernetes. Please, open a bug report against the kubernetes/kubernetes repository.

alvaroaleman avatar Apr 17 '24 12:04 alvaroaleman

Perhaps, but my interface with this problem is controller-runtime. Also, there's a valid argument to be made that people land here looking for a resolution on this specific issue. Whether or not this is a Kubernetes issue isn't relevant to them (or me).

I'll make a test case for the main repo when I have time.

pier-oliviert avatar Apr 17 '24 12:04 pier-oliviert

/transfer kubernetes/kubernetes

alvaroaleman avatar Apr 17 '24 12:04 alvaroaleman

@alvaroaleman: Something went wrong or the destination repo kubernetes-sigs/kubernetes/kubernetes does not exist.

In response to this:

/transfer kubernetes/kubernetes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 17 '24 12:04 k8s-ci-robot

Opened https://github.com/kubernetes/kubernetes/issues/124347 for you guys. Please continue the conversation there.

alvaroaleman avatar Apr 17 '24 12:04 alvaroaleman