test-infra
test-infra copied to clipboard
Sidecar retry upload results in empty started.json and finished.json
What happened:
Prow jobs finished but spyglass doesn't think so, the reason was the both started.json and finished.json were empty
What you expected to happen:
no empty started.json and finished.json
Please provide links to example occurrences, if any:
Example: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/post-test-infra-push-prow/1560393653917061120/
Anything else we need to know?:
This was discussed in https://kubernetes.slack.com/archives/C09QZ4DQB/p1660864464101909?thread_ts=1660813165.240589&cid=C09QZ4DQB, and my suspicion was:
it’s possible that at https://github.com/kubernetes/test-infra/blob/d59b80f8f50f0e48a44b8870dfbdf4c3f107791d/prow/sidecar/run.go#L282, the Marshalled json object was fed into bytes.NewBuffer(), and an inner retry of upload reused the same reader and the reader didn’t reset byte offset to 0 before retrying, so the bug is in retry.
This hypothesis was later on proved by a PR https://github.com/kubernetes/test-infra/pull/27192
/sig testing the io.Reader interface has only a single method of Read, so there is no way to reset the bytes offset back to 0 before next retry. In another word, I don't have a good idea of fixing this bug unfortunately.
I have a case where started.json is correct, but finished.json is empty. Is it the same bug?
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-csi-1-22-on-kubernetes-1-23/1559492392854228992
I have a case where
started.jsonis correct, butfinished.jsonis empty. Is it the same bug? https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-csi-1-22-on-kubernetes-1-23/1559492392854228992
yes it is.
A little more context: this job uses a GCP service account key file to auth with GCS for uploading both json files, and when the service account key is expired, the symptom is flaky instead of failing right away. So it's unpredictable when the GCS operations would success or fail. In this job it very likely:
- write
started.json-> success - write
finished.json-> failed with 403 - retry write
finished.json-> success (but with an empty byte stream)
when the service account key is expired, the symptom is flaky instead of failing right away
I was thinking about this too yesterday and it was unclear to me why (from an authentication standpoint) the 2nd upload succeeds at all. That is, it would appear that GCS sometimes allows (and sometimes rejects) attempts to use an expired service account key, which seems really really wrong and broken. But maybe I'm misunderstanding something here...
@listx worth bringing up with GCS folks i think!
FWIW there is another documented case of where GCS permissions have been flaky: https://github.com/kubeflow/pipelines/issues/403.
Quoted from @alvaroaleman
my understanding is that this is pretty much by design, i.E. reading from an io.Reader again is not going to give you the content again.
The issue here is the reason why for example hashicorps retyrable http doesn't pass around io.Readers but what they call a reader func, which is a constructor for an io.Reader: https://github.com/hashicorp/go-retryablehttp/blob/493aa4cf372e47ec00228558d9d91a9517b045a5/client.go#L104
Maybe we can do something similiar? Shouldn't be too difficult, basically changing the src arg in DataUpload from io.Reader to func() (io.ReadCloser, error) I think
Agree with the approach mentioned above, anyone who is interested in working on this please feel free to grab it from me /help
@chaodaiG: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
Agree with the approach mentioned above, anyone who is interested in working on this please feel free to grab it from me /help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@chaodaiG I tried #27355
@amirrmonfared: You can't close an active issue/PR unless you authored it or you are a collaborator.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/close issue is resolved by https://github.com/kubernetes/test-infra/pull/27355
@amirrmonfared: You can't close an active issue/PR unless you authored it or you are a collaborator.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/close
@alvaroaleman: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.