test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

Sidecar retry upload results in empty started.json and finished.json

Open chaodaiG opened this issue 3 years ago • 10 comments

What happened:

Prow jobs finished but spyglass doesn't think so, the reason was the both started.json and finished.json were empty

What you expected to happen:

no empty started.json and finished.json

Please provide links to example occurrences, if any:

Example: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/post-test-infra-push-prow/1560393653917061120/

Anything else we need to know?:

This was discussed in https://kubernetes.slack.com/archives/C09QZ4DQB/p1660864464101909?thread_ts=1660813165.240589&cid=C09QZ4DQB, and my suspicion was:

it’s possible that at https://github.com/kubernetes/test-infra/blob/d59b80f8f50f0e48a44b8870dfbdf4c3f107791d/prow/sidecar/run.go#L282, the Marshalled json object was fed into bytes.NewBuffer(), and an inner retry of upload reused the same reader and the reader didn’t reset byte offset to 0 before retrying, so the bug is in retry.

This hypothesis was later on proved by a PR https://github.com/kubernetes/test-infra/pull/27192

chaodaiG avatar Aug 19 '22 04:08 chaodaiG

/sig testing the io.Reader interface has only a single method of Read, so there is no way to reset the bytes offset back to 0 before next retry. In another word, I don't have a good idea of fixing this bug unfortunately.

chaodaiG avatar Aug 19 '22 04:08 chaodaiG

I have a case where started.json is correct, but finished.json is empty. Is it the same bug? https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-csi-1-22-on-kubernetes-1-23/1559492392854228992

jsafrane avatar Aug 19 '22 09:08 jsafrane

I have a case where started.json is correct, but finished.json is empty. Is it the same bug? https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-csi-1-22-on-kubernetes-1-23/1559492392854228992

yes it is.

A little more context: this job uses a GCP service account key file to auth with GCS for uploading both json files, and when the service account key is expired, the symptom is flaky instead of failing right away. So it's unpredictable when the GCS operations would success or fail. In this job it very likely:

  • write started.json -> success
  • write finished.json -> failed with 403
  • retry write finished.json -> success (but with an empty byte stream)

chaodaiG avatar Aug 19 '22 14:08 chaodaiG

when the service account key is expired, the symptom is flaky instead of failing right away

I was thinking about this too yesterday and it was unclear to me why (from an authentication standpoint) the 2nd upload succeeds at all. That is, it would appear that GCS sometimes allows (and sometimes rejects) attempts to use an expired service account key, which seems really really wrong and broken. But maybe I'm misunderstanding something here...

listx avatar Aug 20 '22 00:08 listx

@listx worth bringing up with GCS folks i think!

dims avatar Aug 20 '22 17:08 dims

FWIW there is another documented case of where GCS permissions have been flaky: https://github.com/kubeflow/pipelines/issues/403.

listx avatar Aug 20 '22 18:08 listx

Quoted from @alvaroaleman

my understanding is that this is pretty much by design, i.E. reading from an io.Reader again is not going to give you the content again.

The issue here is the reason why for example hashicorps retyrable http doesn't pass around io.Readers but what they call a reader func, which is a constructor for an io.Reader: https://github.com/hashicorp/go-retryablehttp/blob/493aa4cf372e47ec00228558d9d91a9517b045a5/client.go#L104

Maybe we can do something similiar? Shouldn't be too difficult, basically changing the src arg in DataUpload from io.Reader to func() (io.ReadCloser, error) I think

chaodaiG avatar Aug 22 '22 22:08 chaodaiG

Agree with the approach mentioned above, anyone who is interested in working on this please feel free to grab it from me /help

chaodaiG avatar Aug 22 '22 23:08 chaodaiG

@chaodaiG: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

Agree with the approach mentioned above, anyone who is interested in working on this please feel free to grab it from me /help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 22 '22 23:08 k8s-ci-robot

@chaodaiG I tried #27355

danilo-gemoli avatar Sep 02 '22 12:09 danilo-gemoli

@amirrmonfared: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 08 '23 17:01 k8s-ci-robot

/close issue is resolved by https://github.com/kubernetes/test-infra/pull/27355

amirrmonfared avatar Jan 08 '23 17:01 amirrmonfared

@amirrmonfared: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 08 '23 18:01 k8s-ci-robot

/close

alvaroaleman avatar Jan 08 '23 18:01 alvaroaleman

@alvaroaleman: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 08 '23 18:01 k8s-ci-robot