Add logs reporting to submit-upgrade-test-cloud-build for better visibility
What type of PR is this?
Uncomment only one
/kind <>line, press enter to put that in a new line, and remove leading whitespace from that line:/kind breaking /kind bug
/kind cleanup
/kind documentation /kind feature /kind hotfix /kind release
What this PR does / Why we need it:
Which issue(s) this PR fixes:
Closes #4163
Special notes for your reviewer:
Build Failed :sob:
Build Id: 944b0a49-63f1-4ec5-841b-2586ca33a489
Status: FAILURE
To get permission to view the Cloud Build view, join the agones-discuss Google Group.
Build Failed :sob:
Build Id: b5ecf3aa-2b70-4515-beb7-830a0fbd96df
Status: FAILURE
To get permission to view the Cloud Build view, join the agones-discuss Google Group.
Build Failed :sob:
Build Id: c605097d-0c93-4011-8873-71f2b8cbb039
Status: FAILURE
To get permission to view the Cloud Build view, join the agones-discuss Google Group.
Build Failed :sob:
Build Id: 58b3ccdb-181b-49d1-a136-c35cb6fd8ffe
Status: FAILURE
To get permission to view the Cloud Build view, join the agones-discuss Google Group.
Regarding the issue on the build around jq it seems that the submit-upgrade-test-cloud-build job in the cloudbuild.yaml is using the image gcr.io/google.com/cloudsdktool/cloud-sdk and doesn't use the dockerfile from the test/upgrade/Dockerfile but directly apply the files with kubectl, if we want to use jq maybe we can install it on the args from the submit-upgrade-test-cloud-build ?
I'm not fully sure about it, but from what I see, the jq is properly installed on the docker image for the push-upgrade-test, but it's only used by the test itself: https://github.com/googleforgames/agones/blob/97b07cc3444dfc20bddbea5d9d7061b880292891/test/upgrade/upgradeTest.yaml#L29
Regarding the issue on the build around
jqit seems that thesubmit-upgrade-test-cloud-buildjob in thecloudbuild.yamlis using the imagegcr.io/google.com/cloudsdktool/cloud-sdkand doesn't use the dockerfile from thetest/upgrade/Dockerfilebut directly apply the files withkubectl, if we want to use jq maybe we can install it on theargsfrom thesubmit-upgrade-test-cloud-build?I'm not fully sure about it, but from what I see, the jq is properly installed on the docker image for the
push-upgrade-test, but it's only used by the test itself:
Got it! installing jq in the args section.
Build Failed :sob:
Build Id: f25955e4-5711-4d6a-8dae-bccf6032af92
Status: FAILURE
To get permission to view the Cloud Build view, join the agones-discuss Google Group.
Build Failed :sob:
Build Id: b404b589-c45b-41bb-b09c-0ec7a655aabe
Status: FAILURE
To get permission to view the Cloud Build view, join the agones-discuss Google Group.
Build Failed :sob:
Build Id: 3faa7b62-2ab3-48e0-a03e-1e5961ebd386
Status: FAILURE
To get permission to view the Cloud Build view, join the agones-discuss Google Group.
Build Succeeded :partying_face:
Build Id: 8d22703b-c2d3-4183-adfd-9cdb11f3eccc
The following development artifacts have been built, and will exist for the next 30 days:
- image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.49.0-dev-9ce27fd
- image: us-docker.pkg.dev/agones-images/ci/agones-extensions:1.49.0-dev-9ce27fd
- image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.49.0-dev-9ce27fd-linux
- image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.49.0-dev-9ce27fd
- image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.49.0-dev-9ce27fd
- Linux C++ SDK (build): agonessdk-1.49.0-dev-9ce27fd-linux-arch_64.tar.gz
- SDK Server: agonessdk-server-1.49.0-dev-9ce27fd.zip
A preview of the website (the last 30 builds are retained):
- https://9ce27fd-dot-preview-dot-agones-images.appspot.com/
To install this version:
git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.49.0-dev-9ce27fd
Build Failed :sob:
Build Id: 54d61520-601e-400e-9fc9-a4f622905000
Status: FAILURE
To get permission to view the Cloud Build view, join the agones-discuss Google Group.
Build Succeeded :partying_face:
Build Id: 16eec698-e005-4b41-a70b-29791ac18e51
The following development artifacts have been built, and will exist for the next 30 days:
- image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.49.0-dev-1ec611f
- image: us-docker.pkg.dev/agones-images/ci/agones-extensions:1.49.0-dev-1ec611f
- image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.49.0-dev-1ec611f-linux
- image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.49.0-dev-1ec611f
- image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.49.0-dev-1ec611f
- Linux C++ SDK (build): agonessdk-1.49.0-dev-1ec611f-linux-arch_64.tar.gz
- SDK Server: agonessdk-server-1.49.0-dev-1ec611f.zip
A preview of the website (the last 30 builds are retained):
- https://1ec611f-dot-preview-dot-agones-images.appspot.com/
To install this version:
git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.49.0-dev-1ec611f
These line are also not capturing the logs https://github.com/0xaravindh/agones/blob/1ec611fb8ff775ca5bceff60a97fd0b487eff285/cloudbuild.yaml#L388-L394
===== Job status for upgrade-test-runner on cluster standard-upgrade-test-cluster-1-32 =====
{
"active": 1,
"ready": 1,
"startTime": "2025-04-29T06:44:42Z",
"terminating": 0,
"uncountedTerminatedPods": {}
}
===== Extracting job conditions (if available) =====
So, I added a script that keeps checking the job in a loop. It waits until the job finishes. then we can capture the logs. What do you think about it ? @lacroixthomas
# Monitor the job status continuously
while true; do
# Fetch job conditions array length
condition_count=$(kubectl get job/upgrade-test-runner -o json | jq '.status.conditions | length // 0')
if (( condition_count > 1 )); then
echo "===== Extracted job conditions ====="
job_conditions=$(kubectl get job/upgrade-test-runner -o json | jq -c '.status.conditions')
echo "Job conditions output: $job_conditions"
break
fi
sleep 10s
done
echo "===== Extracting job conditions (if available) ====="
job_conditions=$(kubectl get job/upgrade-test-runner -o json | jq -c 'if .status.conditions then .status.conditions[] | select(.status=="True") | {type, reason, message} else empty end')
echo "Job conditions output: $job_conditions"
These line are also not capturing the logs https://github.com/0xaravindh/agones/blob/1ec611fb8ff775ca5bceff60a97fd0b487eff285/cloudbuild.yaml#L388-L394
===== Job status for upgrade-test-runner on cluster standard-upgrade-test-cluster-1-32 ===== { "active": 1, "ready": 1, "startTime": "2025-04-29T06:44:42Z", "terminating": 0, "uncountedTerminatedPods": {} } ===== Extracting job conditions (if available) =====So, I added a script that keeps checking the job in a loop. It waits until the job finishes. then we can capture the logs. What do you think about it ? @lacroixthomas
# Monitor the job status continuously while true; do # Fetch job conditions array length condition_count=$(kubectl get job/upgrade-test-runner -o json | jq '.status.conditions | length // 0') if (( condition_count > 1 )); then echo "===== Extracted job conditions =====" job_conditions=$(kubectl get job/upgrade-test-runner -o json | jq -c '.status.conditions') echo "Job conditions output: $job_conditions" break fi sleep 10s done echo "===== Extracting job conditions (if available) =====" job_conditions=$(kubectl get job/upgrade-test-runner -o json | jq -c 'if .status.conditions then .status.conditions[] | select(.status=="True") | {type, reason, message} else empty end') echo "Job conditions output: $job_conditions"
Hello @0xaravindh It sounds like a good solution, I would probably try to avoid having an infinite loop with a break though, in case for X reason the condition_count nevers go up, it would be stuck
What do you think about something following this idea ? It would not be blocking, the job would be available and would not yet be deleted, we'll still keep the design around keeping the pid and we could add anything related to printing logs or anything ?
function logJobStatus() {
echo "===== Job status for upgrade-test-runner on cluster ${testCluster} =====" >&2
kubectl get job/upgrade-test-runner -o json | jq '.status' > "${tmpdir}/${testCluster}-job-status.json"
cat "${tmpdir}/${testCluster}-job-status.json" > >&2
echo "===== Extracting job conditions (if available) =====" >&2
job_conditions=$(kubectl get job/upgrade-test-runner -o json | jq -c 'if .status.conditions then .status.conditions[] | select(.status=="True") | {type, reason, message} else empty end')
echo "Job conditions output: $job_conditions" >&2
}
...
kubectl wait job/upgrade-test-runner ... | tee "${tmpdir}"/"${testCluster}".log && logJobStatus &
waitPid=$!
pids+=( "$waitPid" )
waitPids[$waitPid]="${tmpdir}"/"${testCluster}".log
...
Build Failed :sob:
Build Id: 273a7c8f-2d14-45b6-9477-66e9436828c8
Status: FAILURE
To get permission to view the Cloud Build view, join the agones-discuss Google Group.
Build Failed :sob:
Build Id: 24638fd0-bbd9-4a31-a830-8e13be4195d4
Status: FAILURE
To get permission to view the Cloud Build view, join the agones-discuss Google Group.
Build Succeeded :partying_face:
Build Id: bf75e1dd-a5d9-432f-8de8-237b5e3143c9
The following development artifacts have been built, and will exist for the next 30 days:
- image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.49.0-dev-078cf50
- image: us-docker.pkg.dev/agones-images/ci/agones-extensions:1.49.0-dev-078cf50
- image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.49.0-dev-078cf50-linux
- image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.49.0-dev-078cf50
- image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.49.0-dev-078cf50
- Linux C++ SDK (build): agonessdk-1.49.0-dev-078cf50-linux-arch_64.tar.gz
- SDK Server: agonessdk-server-1.49.0-dev-078cf50.zip
A preview of the website (the last 30 builds are retained):
- https://078cf50-dot-preview-dot-agones-images.appspot.com/
To install this version:
git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.49.0-dev-078cf50
Build Failed :sob:
Build Id: 6d5d02b8-1989-4429-8ab2-00d7b6ad5094
Status: FAILURE
To get permission to view the Cloud Build view, join the agones-discuss Google Group.
Build Succeeded :partying_face:
Build Id: 9cb95696-4705-49d0-bfa0-d5a39cab7557
The following development artifacts have been built, and will exist for the next 30 days:
- image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.50.0-dev-29cc00c
- image: us-docker.pkg.dev/agones-images/ci/agones-extensions:1.50.0-dev-29cc00c
- image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.50.0-dev-29cc00c-linux
- image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.50.0-dev-29cc00c
- image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.50.0-dev-29cc00c
- Linux C++ SDK (build): agonessdk-1.50.0-dev-29cc00c-linux-arch_64.tar.gz
- SDK Server: agonessdk-server-1.50.0-dev-29cc00c.zip
A preview of the website (the last 30 builds are retained):
- https://29cc00c-dot-preview-dot-agones-images.appspot.com/
To install this version:
git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.50.0-dev-29cc00c
@igooch All upgrade test jobs completed successfully across all clusters, as shown in the logs:
Reading output from log file: /tmp/tmp.urDwAbEcBH/standard-upgrade-test-cluster-1-31.log:
{
"lastProbeTime": "2025-05-12T11:47:03Z",
"lastTransitionTime": "2025-05-12T11:47:03Z",
"message": "Reached expected number of succeeded pods",
"reason": "CompletionsReached",
"status": "True",
"type": "SuccessCriteriaMet"
}
Job completed successfully on cluster associated with log: /tmp/tmp.urDwAbEcBH/standard-upgrade-test-cluster-1-31.log
{"lastProbeTime":"2025-05-12T11:49:37Z","lastTransitionTime":"2025-05-12T11:49:37Z","status":"True","type":"Complete"}Reading output from log file: /tmp/tmp.urDwAbEcBH/standard-upgrade-test-cluster-1-30.log:
{
"lastProbeTime": "2025-05-12T11:49:37Z",
"lastTransitionTime": "2025-05-12T11:49:37Z",
"status": "True",
"type": "Complete"
}
Job completed successfully on cluster associated with log: /tmp/tmp.urDwAbEcBH/standard-upgrade-test-cluster-1-30.log
Reading output from log file: /tmp/tmp.urDwAbEcBH/standard-upgrade-test-cluster-1-32.log:
{
"lastProbeTime": "2025-05-12T11:39:46Z",
"lastTransitionTime": "2025-05-12T11:39:46Z",
"message": "Reached expected number of succeeded pods",
"reason": "CompletionsReached",
"status": "True",
"type": "SuccessCriteriaMet"
}
{"lastProbeTime":"2025-05-12T11:49:37Z","lastTransitionTime":"2025-05-12T11:49:37Z","message":"Reached expected number of succeeded pods","reason":"CompletionsReached","status":"True","type":"SuccessCriteriaMet"}Job completed successfully on cluster associated with log: /tmp/tmp.urDwAbEcBH/standard-upgrade-test-cluster-1-32.log
Reading output from log file: /tmp/tmp.urDwAbEcBH/gke-autopilot-upgrade-test-cluster-1-31.log:
{
"lastProbeTime": "2025-05-12T11:49:37Z",
"lastTransitionTime": "2025-05-12T11:49:37Z",
"message": "Reached expected number of succeeded pods",
"reason": "CompletionsReached",
"status": "True",
"type": "SuccessCriteriaMet"
}
Job completed successfully on cluster associated with log: /tmp/tmp.urDwAbEcBH/gke-autopilot-upgrade-test-cluster-1-31.log
{"lastProbeTime":"2025-05-12T11:51:01Z","lastTransitionTime":"2025-05-12T11:51:01Z","status":"True","type":"Complete"}Reading output from log file: /tmp/tmp.urDwAbEcBH/gke-autopilot-upgrade-test-cluster-1-30.log:
{
"lastProbeTime": "2025-05-12T11:51:01Z",
"lastTransitionTime": "2025-05-12T11:51:01Z",
"status": "True",
"type": "Complete"
}
Job completed successfully on cluster associated with log: /tmp/tmp.urDwAbEcBH/gke-autopilot-upgrade-test-cluster-1-30.log
Reading output from log file: /tmp/tmp.urDwAbEcBH/gke-autopilot-upgrade-test-cluster-1-32.log:
{
"lastProbeTime": "2025-05-12T11:41:29Z",
"lastTransitionTime": "2025-05-12T11:41:29Z",
"message": "Reached expected number of succeeded pods",
"reason": "CompletionsReached",
"status": "True",
"type": "SuccessCriteriaMet"
}
Job completed successfully on cluster associated with log: /tmp/tmp.urDwAbEcBH/gke-autopilot-upgrade-test-cluster-1-32.log
End of Upgrade Tests
Build Failed :sob:
Build Id: eb5a5482-12b5-420e-88b2-57d7173cf485
Status: FAILURE
To get permission to view the Cloud Build view, join the agones-discuss Google Group.
/gcbrun
Build Failed :sob:
Build Id: 0f1dfda2-1ab9-4b52-91c4-febd996e53eb
Status: FAILURE
To get permission to view the Cloud Build view, join the agones-discuss Google Group.
/gcbrun
Build Succeeded :partying_face:
Build Id: e86f1a04-476b-488f-b03c-e2692aa4e296
The following development artifacts have been built, and will exist for the next 30 days:
- image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.50.0-dev-0afd074
- image: us-docker.pkg.dev/agones-images/ci/agones-extensions:1.50.0-dev-0afd074
- image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.50.0-dev-0afd074-linux
- image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.50.0-dev-0afd074
- image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.50.0-dev-0afd074
- Linux C++ SDK (build): agonessdk-1.50.0-dev-0afd074-linux-arch_64.tar.gz
- SDK Server: agonessdk-server-1.50.0-dev-0afd074.zip
A preview of the website (the last 30 builds are retained):
- https://0afd074-dot-preview-dot-agones-images.appspot.com/
To install this version:
git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.50.0-dev-0afd074
Testing locally with a forced failure gives below which does not give much additional information for debugging:
Unexpected job status: 'FailureTarget' with message: 'Job has reached the specified backoff limit' in log /tmp/tmp.KkKQMyzD6o/std-agones.log
}
"type": "FailureTarget"
"status": "True",
"reason": "BackoffLimitExceeded",
"message": "Job has reached the specified backoff limit",
"lastTransitionTime": "2025-05-19T22:46:12Z",
"lastProbeTime": "2025-05-19T22:46:12Z",
{
{"lastProbeTime":"2025-05-19T22:46:12Z","lastTransitionTime":"2025-05-19T22:46:12Z","message":"Job has reached the specified backoff limit","reason":"BackoffLimitExceeded","status":"True","type":"FailureTarget"}Reading output from log file: /tmp/tmp.KkKQMyzD6o/std-agones.log:
Wait for job upgrade-test-runner to complete or fail on cluster std-agones
The logs that will be helpful for debugging are the container logs for the sdk-client-test and upgrade-test-controller test containers. Could you try updating to see if we can get the container logs output instead of or in addition to the job output?
If that doesn't work to get the container logs from Logs Explorer into Cloud Build might require a logs sink https://cloud.google.com/logging/docs/export/configure_export_v2 with an inclusion filter like resource.labels.container_name="sdk-client-test" OR resource.labels.container_name="upgrade-test-controller" which may require setting up a new bucket. Managing another bucket, linking to the bucket, making sure each log only contains the container logs for a single test, and probably other considerations would be quite a bit more complex, so not preferred.
Just poking on this 😄 would love to get some movement on this flaky test.
Not sure if it's the K8s upgrade or the Go upgrade, but this seems way worse now 😬
Not sure if it's the K8s upgrade or the Go upgrade, but this seems way worse now 😬
I'm working on this feature about Route logs and started adding some logs to help figure out the issue. I was busy the past few weeks, but now I have more time to focus on it and will try to fix it soon.
@markmandel Let me know if you have any suggestions, or if you have any other ideas that could help resolve this issue!
Honestly.. I've barely an idea of what this code does 😁