operator-sdk icon indicating copy to clipboard operation
operator-sdk copied to clipboard

r.Status().Update(ctx,object) is taking time

Open kavyamani596 opened this issue 3 years ago • 6 comments

Question

What did you do?

I have a custom resource whose status I update twice, I have made sure these updates do not cause another reconcile by -> WithEventFilter(ignoreStatusUpdate()) in Setup manager.

    connected := r.checkABCConnection(abcR, abcObj, ctx)
if connected {
	setupLog.Info("ABC Present, moving to next step")
	r.getUpdatedABCObject(ctx, req)  // I'm getting the updated object before going forward to next step where I'll 
                                                                       update the status object again
	//get other info and patch status
	r.checkABCOtherInfoConnected(abcRc, abcObj, ctx)
}
r.getUpdatedABCObject(ctx, req)


Get Updated Object Code
func (r *Reconciler)getUpdatedABCObject(ctx context.Context, req ctrl.Request) {
err = r.Get(ctx, req.NamespacedName, abcObj)
if err != nil {
	setupLog.Error(err, "Error fetching updated Object of ABC")
}

}

Updating status codes

func (r *Reconciler) updateStatus(abcObj *infraiov1.Abc, ctx context.Context, status string) { abcObj.Status.Status = status err := r.Status().Update(ctx, abcObj) // Update Func if err != nil { setupLog.Error(err, "Error: Updating Status") } }

func (r *Reconciler) updateOtherInfoStatus(abcObj *infraiov1.Abc, ctx context.Context, otherInfo map[string]string) { abcObj.Status.OtherInfo = otherInfo //update printer column err = r.Status().Update(ctx, abcObj) if err != nil { setupLog.Error(err, "Error: Updating Other Info Status") } }

Abc_types.go file (Status object)

// OdimStatus defines the observed state of Odim //+kubebuilder:validation:XPreserveUnknownFields type OdimStatus struct { Status string json:"status,omitempty"OtherInfo map[string]stringjson:"other,omitempty"}

What did you expect to see?

Proper working of operator with status updates without throwing any error for status updates.

What did you see instead? Under which circumstances?

Sporadically witnessing error like :

2022-07-14T08:51:06Z ERROR ABC Error: Updating OtherInfo Status {"error": "Operation cannot be fulfilled on abc.infra.io.my.domain \"abc\": the object has been modified; please apply your changes to the latest version and try again"} github.com/abc-operator/controllers.(*Reconciler).updateOtherInfoStatus

This error is not supposed to come since I am getting the updated Object before I go to the OtherInfo step

Workaround I did:

I added some sleep in the update funcs like:

func (r *Reconciler) updateStatus(abcObj *infraiov1.Abc, ctx context.Context, status string) { abcObj.Status.Status = status err := r.Status().Update(ctx, abcObj) // Update Func if err != nil { setupLog.Error(err, "Error: Updating Status") } sleep(10) //10sec }

func (r *Reconciler) updateOtherInfoStatus(abcObj *infraiov1.Abc, ctx context.Context, otherInfo map[string]string) { abcObj.Status.OtherInfo = otherInfo //update printer column err = r.Status().Update(ctx, abcObj) if err != nil { setupLog.Error(err, "Error: Updating Other Info Status") } sleep(10) }

And with this I see that the object is up to date before the next step is taken.

So is this the expected behaviour of Update()? will it take some time? if so how many seconds?

$ go version (if language is Go) go version go1.17.8 linux/amd64

$ kubectl version Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0", GitCommit:"4ce5a8954017644c5420bae81d72b09b735c21f0", GitTreeState:"clean", BuildDate:"2022-05-03T13:46:05Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.4 Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:04:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

kavyamani596 avatar Jul 14 '22 09:07 kavyamani596

@kavyamani596 This error from k8s Apiserver can usually be considered benign if its not interfering with the reconciliation process. The correct sequence of an Update call is to first reach the API server, make the change and update the cache. But since this is appearing only sporadically, I would expect this to be throttling in the API server. Here is the reference: https://github.com/kubernetes/kubernetes/issues/28149

varshaprasad96 avatar Jul 18 '22 17:07 varshaprasad96

Hi @varshaprasad96 , This is not impacting the reconcile func. The update takes place after throwing that error.. when I do "kubectl get abc", I see the updated status there But the problem is... in the code, the updated Object is not fetched when I encounter that Error, and this is causing problem in the next step, as it is dependent on the previous update.

And considering the resource version that changes for each update, that is taken care by kubernetes only right? So from our side, we should face this problem, Right?

kavyamani596 avatar Jul 19 '22 05:07 kavyamani596

Hi @kavyamani596,

When we use the client in the controller we are getting the state of the resource on the cluster at that moment. After that, if you update the resource OR the ks8 api ( see that has fields that are updated automatically by the API ) or another process/users to do so the state of the resource on the cluster is no longer the same.

I mean that, if we try to update a resource and the data/state that we have does not matches with the state of it when you send another update ( i.e the generation/resourceVersion will no longer the same since k8s updated them) the error the object has been modified; will be faced.

Therefore, I'd recommend before you do any new updates you fetch the resource again. Todo the check and the changes then update. If the update fails you are probably returning the error which will make the whole reconcile runs again and that can take more time ( spend unnecessary resources )

I hope that makes sense and that the above info help you out.

NIT: Less important: Also, your example/snippet code makes it not clear why you are updating the same resource twice. Could you not improve this code to have only 1 update.

In this way, it seems sorted out. If not, please let us know why not and what more is missing here for we happy you out.

c/c @varshaprasad96

camilamacedo86 avatar Jul 19 '22 11:07 camilamacedo86

hi @camilamacedo86

Yes I have done that step.

I think you missed to notice that..

    setupLog.Info("ABC Present, moving to next step")
r.getUpdatedABCObject(ctx, req)  **// I'm getting the updated object before going forward to next step where I'll 
                                                                   update the status object again**
//get other info and patch status
r.checkABCOtherInfoConnected(abcRc, abcObj, ctx)

So we have a use case where the 2nd update needs to happen after the first, hence we cannot change that sequence.

kavyamani596 avatar Jul 20 '22 02:07 kavyamani596

Hi @kavyamani596,

You say that you are I'm getting the updated object before going forward to next step where I'l update the status object again and still facing "Error: Updating Other Info Status".

In this way:

  • a) So, did you troubleshoot it to know when you fetch the resource you have the changes ? You might be trying to retrieve the data before it actually is performed on the cluster side. It is not because you call the client.update that is done.

  • b) Here r.getUpdatedABCObject(ctx, req) are you checking if an error was faced and if that was accomplished? Did you try to add a sleep before retrieving the data and do the second update to see if then it will be acctually getting the latest state on the cluster?

  • c) Did you check if on the cluster you have other Operators or processes which are watching this resource and doing changes to it?

camilamacedo86 avatar Jul 21 '22 09:07 camilamacedo86

Hello @camilamacedo86 ,

a) So, did you troubleshoot it to know when you fetch the resource you have the changes ? You might be trying to retrieve the data before it actually is performed on the cluster side. It is not because you call the client.update that is done

Yes, I did add a print log to check the object before proceeding, and there I was able to catch that this update issue is happening,

So when I added Print log before calling r.getUpdatedABCObject(ctx, req), the status would sometimes be updated and sometimes not..

b) Here r.getUpdatedABCObject(ctx, req) are you checking if an error was faced and if that was accomplished? Did you try to add a sleep before retrieving the data and do the second update to see if then it will be acctually getting the latest state on the cluster?

Yes,error is being checked

Get Updated Object Code func (r *Reconciler)getUpdatedABCObject(ctx context.Context, req ctrl.Request) { err = r.Get(ctx, req.NamespacedName, abcObj) if err != nil { setupLog.Error(err, "Error fetching updated Object of ABC") --> checking this here... I never got this log }

I added sleep when I after I did the update, As soon as update is done, next step is called, so you can think that it is happening before getting of the updated object

func (r *Reconciler) updateStatus(abcObj *infraiov1.Abc, ctx context.Context, status string) {
 abcObj.Status.Status = status 
 err := r.Status().Update(ctx, abcObj) // Update Func 
 if err != nil {
  setupLog.Error(err, "Error: Updating Status") 
  } sleep(10) //10sec  --> Here
  }

c) Did you check if on the cluster you have other Operators or processes which are watching this resource and doing changes to it?

yes, Currently we have only one operator in our cluster and this is it.

kavyamani596 avatar Jul 21 '22 12:07 kavyamani596

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Oct 20 '22 01:10 openshift-bot

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot avatar Nov 19 '22 08:11 openshift-bot

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-bot avatar Dec 20 '22 00:12 openshift-bot

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci[bot] avatar Dec 20 '22 00:12 openshift-ci[bot]