agones icon indicating copy to clipboard operation
agones copied to clipboard

Add logs reporting to submit-upgrade-test-cloud-build for better visibility

Open 0xaravindh opened this issue 8 months ago • 27 comments

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking /kind bug

/kind cleanup

/kind documentation /kind feature /kind hotfix /kind release

What this PR does / Why we need it:

Which issue(s) this PR fixes:

Closes #4163

Special notes for your reviewer:

0xaravindh avatar Apr 28 '25 10:04 0xaravindh

Build Failed :sob:

Build Id: 944b0a49-63f1-4ec5-841b-2586ca33a489

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot avatar Apr 28 '25 11:04 agones-bot

Build Failed :sob:

Build Id: b5ecf3aa-2b70-4515-beb7-830a0fbd96df

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot avatar Apr 28 '25 11:04 agones-bot

Build Failed :sob:

Build Id: c605097d-0c93-4011-8873-71f2b8cbb039

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot avatar Apr 28 '25 12:04 agones-bot

Build Failed :sob:

Build Id: 58b3ccdb-181b-49d1-a136-c35cb6fd8ffe

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot avatar Apr 28 '25 13:04 agones-bot

Regarding the issue on the build around jq it seems that the submit-upgrade-test-cloud-build job in the cloudbuild.yaml is using the image gcr.io/google.com/cloudsdktool/cloud-sdk and doesn't use the dockerfile from the test/upgrade/Dockerfile but directly apply the files with kubectl, if we want to use jq maybe we can install it on the args from the submit-upgrade-test-cloud-build ?

I'm not fully sure about it, but from what I see, the jq is properly installed on the docker image for the push-upgrade-test, but it's only used by the test itself: https://github.com/googleforgames/agones/blob/97b07cc3444dfc20bddbea5d9d7061b880292891/test/upgrade/upgradeTest.yaml#L29

lacroixthomas avatar Apr 28 '25 13:04 lacroixthomas

Regarding the issue on the build around jq it seems that the submit-upgrade-test-cloud-build job in the cloudbuild.yaml is using the image gcr.io/google.com/cloudsdktool/cloud-sdk and doesn't use the dockerfile from the test/upgrade/Dockerfile but directly apply the files with kubectl, if we want to use jq maybe we can install it on the args from the submit-upgrade-test-cloud-build ?

I'm not fully sure about it, but from what I see, the jq is properly installed on the docker image for the push-upgrade-test, but it's only used by the test itself:


Got it! installing jq in the args section.

0xaravindh avatar Apr 28 '25 14:04 0xaravindh

Build Failed :sob:

Build Id: f25955e4-5711-4d6a-8dae-bccf6032af92

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot avatar Apr 28 '25 14:04 agones-bot

Build Failed :sob:

Build Id: b404b589-c45b-41bb-b09c-0ec7a655aabe

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot avatar Apr 28 '25 14:04 agones-bot

Build Failed :sob:

Build Id: 3faa7b62-2ab3-48e0-a03e-1e5961ebd386

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot avatar Apr 28 '25 14:04 agones-bot

Build Succeeded :partying_face:

Build Id: 8d22703b-c2d3-4183-adfd-9cdb11f3eccc

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

  • https://9ce27fd-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.49.0-dev-9ce27fd

agones-bot avatar Apr 28 '25 16:04 agones-bot

Build Failed :sob:

Build Id: 54d61520-601e-400e-9fc9-a4f622905000

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot avatar Apr 28 '25 17:04 agones-bot

Build Succeeded :partying_face:

Build Id: 16eec698-e005-4b41-a70b-29791ac18e51

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

  • https://1ec611f-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.49.0-dev-1ec611f

agones-bot avatar Apr 29 '25 07:04 agones-bot

These line are also not capturing the logs https://github.com/0xaravindh/agones/blob/1ec611fb8ff775ca5bceff60a97fd0b487eff285/cloudbuild.yaml#L388-L394

Build link

===== Job status for upgrade-test-runner on cluster standard-upgrade-test-cluster-1-32 =====
{
  "active": 1,
  "ready": 1,
  "startTime": "2025-04-29T06:44:42Z",
  "terminating": 0,
  "uncountedTerminatedPods": {}
}
===== Extracting job conditions (if available) =====

So, I added a script that keeps checking the job in a loop. It waits until the job finishes. then we can capture the logs. What do you think about it ? @lacroixthomas

# Monitor the job status continuously
while true; do
  # Fetch job conditions array length
  condition_count=$(kubectl get job/upgrade-test-runner -o json | jq '.status.conditions | length // 0')

  if (( condition_count > 1 )); then
      echo "===== Extracted job conditions ====="
      job_conditions=$(kubectl get job/upgrade-test-runner -o json | jq -c '.status.conditions')
      echo "Job conditions output: $job_conditions"
      break
  fi

  sleep 10s
done

echo "===== Extracting job conditions (if available) ====="
job_conditions=$(kubectl get job/upgrade-test-runner -o json | jq -c 'if .status.conditions then .status.conditions[] | select(.status=="True") | {type, reason, message} else empty end')
echo "Job conditions output: $job_conditions"

0xaravindh avatar May 02 '25 17:05 0xaravindh

These line are also not capturing the logs https://github.com/0xaravindh/agones/blob/1ec611fb8ff775ca5bceff60a97fd0b487eff285/cloudbuild.yaml#L388-L394

Build link

===== Job status for upgrade-test-runner on cluster standard-upgrade-test-cluster-1-32 =====
{
  "active": 1,
  "ready": 1,
  "startTime": "2025-04-29T06:44:42Z",
  "terminating": 0,
  "uncountedTerminatedPods": {}
}
===== Extracting job conditions (if available) =====

So, I added a script that keeps checking the job in a loop. It waits until the job finishes. then we can capture the logs. What do you think about it ? @lacroixthomas

# Monitor the job status continuously
while true; do
  # Fetch job conditions array length
  condition_count=$(kubectl get job/upgrade-test-runner -o json | jq '.status.conditions | length // 0')

  if (( condition_count > 1 )); then
      echo "===== Extracted job conditions ====="
      job_conditions=$(kubectl get job/upgrade-test-runner -o json | jq -c '.status.conditions')
      echo "Job conditions output: $job_conditions"
      break
  fi

  sleep 10s
done

echo "===== Extracting job conditions (if available) ====="
job_conditions=$(kubectl get job/upgrade-test-runner -o json | jq -c 'if .status.conditions then .status.conditions[] | select(.status=="True") | {type, reason, message} else empty end')
echo "Job conditions output: $job_conditions"

Hello @0xaravindh It sounds like a good solution, I would probably try to avoid having an infinite loop with a break though, in case for X reason the condition_count nevers go up, it would be stuck

What do you think about something following this idea ? It would not be blocking, the job would be available and would not yet be deleted, we'll still keep the design around keeping the pid and we could add anything related to printing logs or anything ?

function logJobStatus() {
    echo "===== Job status for upgrade-test-runner on cluster ${testCluster} ====="  >&2
    kubectl get job/upgrade-test-runner -o json | jq '.status' > "${tmpdir}/${testCluster}-job-status.json"
    cat "${tmpdir}/${testCluster}-job-status.json" >  >&2

    echo "===== Extracting job conditions (if available) ====="  >&2
    job_conditions=$(kubectl get job/upgrade-test-runner -o json | jq -c 'if .status.conditions then .status.conditions[] | select(.status=="True") | {type, reason, message} else empty end')
    echo "Job conditions output: $job_conditions"  >&2
}

...
kubectl wait job/upgrade-test-runner ... | tee "${tmpdir}"/"${testCluster}".log && logJobStatus &
waitPid=$!
pids+=( "$waitPid" )
waitPids[$waitPid]="${tmpdir}"/"${testCluster}".log
...

lacroixthomas avatar May 02 '25 18:05 lacroixthomas

Build Failed :sob:

Build Id: 273a7c8f-2d14-45b6-9477-66e9436828c8

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot avatar May 05 '25 06:05 agones-bot

Build Failed :sob:

Build Id: 24638fd0-bbd9-4a31-a830-8e13be4195d4

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot avatar May 05 '25 09:05 agones-bot

Build Succeeded :partying_face:

Build Id: bf75e1dd-a5d9-432f-8de8-237b5e3143c9

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

  • https://078cf50-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.49.0-dev-078cf50

agones-bot avatar May 05 '25 11:05 agones-bot

Build Failed :sob:

Build Id: 6d5d02b8-1989-4429-8ab2-00d7b6ad5094

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot avatar May 12 '25 10:05 agones-bot

Build Succeeded :partying_face:

Build Id: 9cb95696-4705-49d0-bfa0-d5a39cab7557

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

  • https://29cc00c-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.50.0-dev-29cc00c

agones-bot avatar May 12 '25 12:05 agones-bot

@igooch All upgrade test jobs completed successfully across all clusters, as shown in the logs:

Reading output from log file: /tmp/tmp.urDwAbEcBH/standard-upgrade-test-cluster-1-31.log:
{
  "lastProbeTime": "2025-05-12T11:47:03Z",
  "lastTransitionTime": "2025-05-12T11:47:03Z",
  "message": "Reached expected number of succeeded pods",
  "reason": "CompletionsReached",
  "status": "True",
  "type": "SuccessCriteriaMet"
}
Job completed successfully on cluster associated with log: /tmp/tmp.urDwAbEcBH/standard-upgrade-test-cluster-1-31.log
{"lastProbeTime":"2025-05-12T11:49:37Z","lastTransitionTime":"2025-05-12T11:49:37Z","status":"True","type":"Complete"}Reading output from log file: /tmp/tmp.urDwAbEcBH/standard-upgrade-test-cluster-1-30.log:
{
  "lastProbeTime": "2025-05-12T11:49:37Z",
  "lastTransitionTime": "2025-05-12T11:49:37Z",
  "status": "True",
  "type": "Complete"
}
Job completed successfully on cluster associated with log: /tmp/tmp.urDwAbEcBH/standard-upgrade-test-cluster-1-30.log
Reading output from log file: /tmp/tmp.urDwAbEcBH/standard-upgrade-test-cluster-1-32.log:
{
  "lastProbeTime": "2025-05-12T11:39:46Z",
  "lastTransitionTime": "2025-05-12T11:39:46Z",
  "message": "Reached expected number of succeeded pods",
  "reason": "CompletionsReached",
  "status": "True",
  "type": "SuccessCriteriaMet"
}
{"lastProbeTime":"2025-05-12T11:49:37Z","lastTransitionTime":"2025-05-12T11:49:37Z","message":"Reached expected number of succeeded pods","reason":"CompletionsReached","status":"True","type":"SuccessCriteriaMet"}Job completed successfully on cluster associated with log: /tmp/tmp.urDwAbEcBH/standard-upgrade-test-cluster-1-32.log
Reading output from log file: /tmp/tmp.urDwAbEcBH/gke-autopilot-upgrade-test-cluster-1-31.log:
{
  "lastProbeTime": "2025-05-12T11:49:37Z",
  "lastTransitionTime": "2025-05-12T11:49:37Z",
  "message": "Reached expected number of succeeded pods",
  "reason": "CompletionsReached",
  "status": "True",
  "type": "SuccessCriteriaMet"
}
Job completed successfully on cluster associated with log: /tmp/tmp.urDwAbEcBH/gke-autopilot-upgrade-test-cluster-1-31.log
{"lastProbeTime":"2025-05-12T11:51:01Z","lastTransitionTime":"2025-05-12T11:51:01Z","status":"True","type":"Complete"}Reading output from log file: /tmp/tmp.urDwAbEcBH/gke-autopilot-upgrade-test-cluster-1-30.log:
{
  "lastProbeTime": "2025-05-12T11:51:01Z",
  "lastTransitionTime": "2025-05-12T11:51:01Z",
  "status": "True",
  "type": "Complete"
}
Job completed successfully on cluster associated with log: /tmp/tmp.urDwAbEcBH/gke-autopilot-upgrade-test-cluster-1-30.log
Reading output from log file: /tmp/tmp.urDwAbEcBH/gke-autopilot-upgrade-test-cluster-1-32.log:
{
  "lastProbeTime": "2025-05-12T11:41:29Z",
  "lastTransitionTime": "2025-05-12T11:41:29Z",
  "message": "Reached expected number of succeeded pods",
  "reason": "CompletionsReached",
  "status": "True",
  "type": "SuccessCriteriaMet"
}
Job completed successfully on cluster associated with log: /tmp/tmp.urDwAbEcBH/gke-autopilot-upgrade-test-cluster-1-32.log
End of Upgrade Tests

0xaravindh avatar May 12 '25 13:05 0xaravindh

Build Failed :sob:

Build Id: eb5a5482-12b5-420e-88b2-57d7173cf485

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot avatar May 19 '25 06:05 agones-bot

/gcbrun

0xaravindh avatar May 19 '25 07:05 0xaravindh

Build Failed :sob:

Build Id: 0f1dfda2-1ab9-4b52-91c4-febd996e53eb

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot avatar May 19 '25 07:05 agones-bot

/gcbrun

0xaravindh avatar May 19 '25 08:05 0xaravindh

Build Succeeded :partying_face:

Build Id: e86f1a04-476b-488f-b03c-e2692aa4e296

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

  • https://0afd074-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.50.0-dev-0afd074

agones-bot avatar May 19 '25 09:05 agones-bot

Testing locally with a forced failure gives below which does not give much additional information for debugging:

Unexpected job status: 'FailureTarget' with message: 'Job has reached the specified backoff limit' in log /tmp/tmp.KkKQMyzD6o/std-agones.log
}
  "type": "FailureTarget"
  "status": "True",
  "reason": "BackoffLimitExceeded",
  "message": "Job has reached the specified backoff limit",
  "lastTransitionTime": "2025-05-19T22:46:12Z",
  "lastProbeTime": "2025-05-19T22:46:12Z",
{
{"lastProbeTime":"2025-05-19T22:46:12Z","lastTransitionTime":"2025-05-19T22:46:12Z","message":"Job has reached the specified backoff limit","reason":"BackoffLimitExceeded","status":"True","type":"FailureTarget"}Reading output from log file: /tmp/tmp.KkKQMyzD6o/std-agones.log:
Wait for job upgrade-test-runner to complete or fail on cluster std-agones

The logs that will be helpful for debugging are the container logs for the sdk-client-test and upgrade-test-controller test containers. Could you try updating to see if we can get the container logs output instead of or in addition to the job output?

If that doesn't work to get the container logs from Logs Explorer into Cloud Build might require a logs sink https://cloud.google.com/logging/docs/export/configure_export_v2 with an inclusion filter like resource.labels.container_name="sdk-client-test" OR resource.labels.container_name="upgrade-test-controller" which may require setting up a new bucket. Managing another bucket, linking to the bucket, making sure each log only contains the container logs for a single test, and probably other considerations would be quite a bit more complex, so not preferred.

igooch avatar May 19 '25 23:05 igooch

Just poking on this 😄 would love to get some movement on this flaky test.

markmandel avatar Jun 14 '25 02:06 markmandel

Not sure if it's the K8s upgrade or the Go upgrade, but this seems way worse now 😬

markmandel avatar Jun 16 '25 09:06 markmandel

Not sure if it's the K8s upgrade or the Go upgrade, but this seems way worse now 😬

I'm working on this feature about Route logs and started adding some logs to help figure out the issue. I was busy the past few weeks, but now I have more time to focus on it and will try to fix it soon.

@markmandel Let me know if you have any suggestions, or if you have any other ideas that could help resolve this issue!

0xaravindh avatar Jun 16 '25 10:06 0xaravindh

Honestly.. I've barely an idea of what this code does 😁

markmandel avatar Jun 16 '25 10:06 markmandel