origin icon indicating copy to clipboard operation
origin copied to clipboard

feat: query data update

Open eggfoobar opened this issue 2 years ago • 5 comments

/hold

alerts Information

There were (165) added jobs and (30) were removed.

Comparisons were above allowed leeway of 1m0s

Note: alerts had 4 jobs increased and 86 jobs decreased.

Name Release From Arch Network Platform Topology Time Increase
KubePersistentVolumeErrors 4.12 amd64 sdn azure ha 29m20.76s
etcdGRPCRequestsSlow 4.12 ppc64le sdn libvirt ha 10m57.3s
etcdGRPCRequestsSlow 4.12 s390x sdn libvirt ha 10m57.3s
etcdGRPCRequestsSlow 4.12 amd64 ovn vsphere ha 3m16.96s

Missing Data

Note: Jobs were missing from the new data set but was present in the previous dataset.

Name Release From Arch Network Platform Topology Time Increase
PrometheusOperatorWatchErrors 4.12 4.12 arm64 sdn aws ha 0s
etcdInsufficientMembers 4.12 arm64 sdn aws single 0s
etcdHighNumberOfLeaderChanges 4.12 arm64 sdn aws single 0s
etcdMembersDown 4.12 4.12 arm64 sdn aws ha 0s
etcdGRPCRequestsSlow 4.12 4.12 arm64 sdn aws ha 0s
KubeClientErrors 4.12 4.12 arm64 sdn aws ha 0s
KubeAPIErrorBudgetBurn 4.12 arm64 sdn aws single 0s
etcdNoLeader 4.12 4.12 arm64 sdn aws ha 0s
etcdMemberCommunicationSlow 4.12 arm64 sdn aws single 0s
KubeAPIErrorBudgetBurn 4.12 4.12 arm64 sdn aws ha 0s
etcdHighNumberOfFailedGRPCRequests 4.12 4.12 arm64 sdn aws ha 0s
etcdHighNumberOfLeaderChanges 4.12 4.12 arm64 sdn aws ha 0s
etcdInsufficientMembers 4.12 4.12 arm64 sdn aws ha 0s
MCDDrainError 4.12 arm64 sdn aws single 0s
etcdHighFsyncDurations 4.12 4.12 arm64 sdn aws ha 0s
KubeClientErrors 4.12 arm64 sdn aws single 0s
KubePersistentVolumeErrors 4.12 4.12 arm64 sdn aws ha 0s
KubePersistentVolumeErrors 4.12 arm64 sdn aws single 0s
VSphereOpenshiftNodeHealthFail 4.12 4.12 arm64 sdn aws ha 0s
etcdHighFsyncDurations 4.12 arm64 sdn aws single 0s
MCDDrainError 4.12 4.12 arm64 sdn aws ha 0s
PrometheusOperatorWatchErrors 4.12 arm64 sdn aws single 0s
etcdNoLeader 4.12 arm64 sdn aws single 0s
etcdHighCommitDurations 4.12 4.12 arm64 sdn aws ha 0s
etcdMembersDown 4.12 arm64 sdn aws single 0s
etcdGRPCRequestsSlow 4.12 arm64 sdn aws single 0s
VSphereOpenshiftNodeHealthFail 4.12 arm64 sdn aws single 0s
etcdHighCommitDurations 4.12 arm64 sdn aws single 0s
etcdMemberCommunicationSlow 4.12 4.12 arm64 sdn aws ha 0s
etcdHighNumberOfFailedGRPCRequests 4.12 arm64 sdn aws single 0s

disruptions Information

There were (215) added jobs and (79) were removed.

Comparisons were above allowed leeway of 1m0s

Note: disruptions had 67 jobs increased and 305 jobs decreased.

Name Release From Arch Network Platform Topology Time Increase
ingress-to-console-reused-connections 4.12 amd64 sdn aws ha 45m46.12s
ingress-to-console-new-connections 4.12 arm64 sdn aws ha 44m42.75s
ingress-to-console-new-connections 4.12 amd64 sdn aws ha 44m42.75s
openshift-api-reused-connections 4.12 amd64 sdn aws ha 5m37.68s
openshift-api-reused-connections 4.12 arm64 sdn aws ha 5m37.68s
oauth-api-new-connections 4.12 amd64 ovn metal single 5m23.9s
oauth-api-reused-connections 4.12 amd64 ovn metal single 5m23.78s
cache-oauth-api-new-connections 4.12 amd64 ovn metal single 5m16.7s
cache-oauth-api-reused-connections 4.12 amd64 ovn metal single 5m16.58s
cache-openshift-api-reused-connections 4.12 amd64 sdn aws ha 5m15.66s
oauth-api-reused-connections 4.12 amd64 sdn aws ha 5m4.68s
oauth-api-reused-connections 4.12 arm64 sdn aws ha 5m4.68s
openshift-api-new-connections 4.12 arm64 sdn aws ha 5m4.26s
openshift-api-new-connections 4.12 amd64 sdn aws ha 5m4.26s
cache-oauth-api-reused-connections 4.12 amd64 sdn aws ha 5m2.61s
cache-oauth-api-reused-connections 4.12 arm64 sdn aws ha 5m2.61s
cache-openshift-api-new-connections 4.12 arm64 sdn aws ha 5m2.08s
cache-openshift-api-new-connections 4.12 amd64 sdn aws ha 5m2.08s
oauth-api-new-connections 4.12 amd64 sdn aws ha 4m27.25s
oauth-api-new-connections 4.12 arm64 sdn aws ha 4m27.25s
cache-openshift-api-reused-connections 4.12 amd64 ovn aws single 4m21.12s
cache-oauth-api-new-connections 4.12 arm64 sdn aws ha 4m16.8s
cache-oauth-api-new-connections 4.12 amd64 sdn aws ha 4m16.8s
kube-api-new-connections 4.12 arm64 sdn aws ha 4m7.8s
kube-api-new-connections 4.12 amd64 sdn aws ha 4m7.8s
cache-kube-api-new-connections 4.12 amd64 sdn aws ha 4m7.46s
cache-kube-api-new-connections 4.12 arm64 sdn aws ha 4m7.46s
openshift-api-reused-connections 4.12 amd64 ovn aws single 4m1.92s
cache-openshift-api-new-connections 4.12 arm64 ovn aws single 3m46.68s
cache-openshift-api-new-connections 4.12 amd64 ovn aws single 3m46.68s
kube-api-reused-connections 4.12 arm64 sdn aws ha 3m38.97s
kube-api-reused-connections 4.12 amd64 sdn aws ha 3m38.97s
cache-kube-api-reused-connections 4.12 amd64 sdn aws ha 3m35.54s
openshift-api-new-connections 4.12 amd64 ovn aws single 3m35.16s
openshift-api-new-connections 4.12 arm64 ovn aws single 3m35.16s
cache-kube-api-reused-connections 4.12 4.11 amd64 sdn azure ha 3m28.42s
cache-openshift-api-reused-connections 4.12 4.11 amd64 sdn azure ha 3m21.98s
ingress-to-oauth-server-new-connections 4.12 4.11 amd64 sdn azure ha 3m17.21s
cache-openshift-api-new-connections 4.12 4.11 amd64 sdn azure ha 3m16.1s
cache-oauth-api-new-connections 4.12 4.11 amd64 sdn azure ha 3m7.16s
cache-kube-api-new-connections 4.12 4.11 amd64 sdn azure ha 3m6.72s
ingress-to-console-new-connections 4.12 4.11 amd64 sdn azure ha 2m49.16s
openshift-api-new-connections 4.12 amd64 ovn metal single 2m35.34s
openshift-api-reused-connections 4.12 amd64 ovn metal single 2m35.24s
ingress-to-console-reused-connections 4.12 amd64 ovn metal single 2m27.14s
ingress-to-console-new-connections 4.12 amd64 ovn metal single 2m27.06s
ingress-to-oauth-server-reused-connections 4.12 4.11 amd64 sdn azure ha 2m22.58s
ingress-to-oauth-server-new-connections 4.12 amd64 ovn metal single 2m19.56s
ingress-to-console-reused-connections 4.12 4.11 amd64 sdn azure ha 2m16.41s
ingress-to-oauth-server-reused-connections 4.12 amd64 ovn metal single 2m13.46s
kube-api-reused-connections 4.12 amd64 ovn metal single 2m7.9s
kube-api-new-connections 4.12 amd64 ovn metal single 2m7.76s
openshift-api-new-connections 4.12 4.11 amd64 sdn azure ha 2m0.48s
oauth-api-reused-connections 4.12 4.11 amd64 sdn azure ha 1m59.61s
kube-api-new-connections 4.12 4.11 amd64 sdn azure ha 1m54.97s
ingress-to-oauth-server-new-connections 4.12 4.11 amd64 sdn aws ha 1m51.54s
ingress-to-oauth-server-new-connections 4.12 4.11 arm64 sdn aws ha 1m51.54s
ingress-to-console-new-connections 4.12 4.11 amd64 sdn aws ha 1m45.43s
ingress-to-console-new-connections 4.12 4.11 arm64 sdn aws ha 1m45.43s
oauth-api-new-connections 4.12 4.11 amd64 sdn azure ha 1m32.53s
cache-oauth-api-reused-connections 4.12 4.11 amd64 sdn azure ha 1m30.48s
kube-api-reused-connections 4.12 4.11 amd64 sdn azure ha 1m22.27s
ingress-to-console-reused-connections 4.12 4.11 arm64 sdn aws ha 1m21.04s
ingress-to-console-reused-connections 4.12 4.11 amd64 sdn aws ha 1m21.04s
ingress-to-oauth-server-new-connections 4.12 4.12 amd64 sdn aws ha 1m9.41s
cache-openshift-api-new-connections 4.12 amd64 ovn metal single 1m5.54s
cache-openshift-api-reused-connections 4.12 amd64 ovn metal single 1m3.32s

Missing Data

Note: Jobs were missing from the new data set but was present in the previous dataset.

Name Release From Arch Network Platform Topology Time Increase
cache-openshift-api-reused-connections 4.12 arm64 ovn aws single 0s
kube-api-new-connections 4.12 arm64 sdn aws single 0s
kube-api-reused-connections 4.12 amd64 ovn metal ha 0s
oauth-api-new-connections 4.12 amd64 sdn vsphere ha 0s
oauth-api-new-connections 4.12 arm64 sdn aws single 0s
openshift-api-new-connections 4.12 amd64 sdn vsphere ha 0s
oauth-api-new-connections 4.12 4.12 arm64 sdn aws ha 0s
openshift-api-new-connections 4.12 arm64 sdn aws single 0s
oauth-api-reused-connections 4.12 amd64 ovn metal ha 0s
ingress-to-oauth-server-reused-connections 4.12 arm64 ovn aws ha 0s
ingress-to-console-reused-connections 4.12 4.12 arm64 sdn aws ha 0s
openshift-api-reused-connections 4.12 amd64 sdn ovirt ha 0s
image-registry-reused-connections 4.12 4.12 arm64 sdn aws ha 0s
openshift-api-reused-connections 4.12 arm64 sdn aws single 0s
ingress-to-oauth-server-new-connections 4.12 4.12 arm64 sdn aws ha 0s
ingress-to-console-reused-connections 4.12 arm64 ovn aws single 0s
kube-api-new-connections 4.12 4.12 arm64 sdn aws ha 0s
cache-oauth-api-reused-connections 4.12 amd64 ovn metal ha 0s
service-load-balancer-with-pdb-reused-connections 4.12 4.12 arm64 sdn aws ha 0s
cache-openshift-api-reused-connections 4.12 arm64 sdn aws ha 0s
ingress-to-oauth-server-reused-connections 4.12 4.10 amd64 sdn aws ha 0s
cache-kube-api-reused-connections 4.12 arm64 sdn aws single 0s
cache-oauth-api-reused-connections 4.12 arm64 sdn aws single 0s
ingress-to-oauth-server-reused-connections 4.12 4.12 arm64 sdn aws ha 0s
image-registry-new-connections 4.12 4.12 arm64 sdn aws ha 0s
cache-oauth-api-new-connections 4.12 amd64 sdn ha 0s
ingress-to-oauth-server-new-connections 4.12 arm64 sdn aws single 0s
oauth-api-reused-connections 4.12 amd64 sdn vsphere ha 0s
kube-api-new-connections 4.12 4.10 amd64 sdn aws ha 0s
cache-openshift-api-reused-connections 4.12 amd64 sdn ovirt ha 0s
openshift-api-reused-connections 4.12 amd64 ovn metal ha 0s
cache-oauth-api-reused-connections 4.12 amd64 sdn ovirt ha 0s
ingress-to-console-new-connections 4.12 arm64 sdn aws single 0s
openshift-api-reused-connections 4.12 4.10 amd64 sdn aws ha 0s
ingress-to-console-new-connections 4.12 4.12 arm64 sdn aws ha 0s
cache-kube-api-new-connections 4.12 arm64 sdn aws single 0s
cache-openshift-api-new-connections 4.12 amd64 sdn vsphere ha 0s
cache-openshift-api-new-connections 4.12 amd64 ovn ha 0s
ci-cluster-network-liveness-reused-connections 4.12 4.12 amd64 ovn azure single 0s
cache-oauth-api-new-connections 4.12 4.10 amd64 sdn aws ha 0s
cache-openshift-api-reused-connections 4.12 arm64 sdn aws single 0s
kube-api-reused-connections 4.12 4.12 arm64 sdn aws ha 0s
cache-kube-api-reused-connections 4.12 amd64 sdn vsphere ha 0s
ci-cluster-network-liveness-new-connections 4.12 4.12 arm64 sdn aws ha 0s
cache-kube-api-reused-connections 4.12 arm64 sdn aws ha 0s
openshift-api-reused-connections 4.12 arm64 ovn aws single 0s
openshift-api-reused-connections 4.12 amd64 sdn vsphere ha 0s
ingress-to-console-reused-connections 4.12 amd64 ovn ha 0s
cache-openshift-api-new-connections 4.12 4.10 amd64 sdn aws ha 0s
ingress-to-console-reused-connections 4.12 arm64 sdn aws single 0s
openshift-api-new-connections 4.12 4.10 amd64 sdn aws ha 0s
ingress-to-oauth-server-reused-connections 4.12 amd64 ovn ha 0s
cache-oauth-api-new-connections 4.12 arm64 sdn aws single 0s
cache-oauth-api-reused-connections 4.12 amd64 sdn vsphere ha 0s
openshift-api-new-connections 4.12 amd64 ovn ha 0s
ingress-to-console-reused-connections 4.12 amd64 sdn ovirt ha 0s
kube-api-reused-connections 4.12 amd64 sdn ovirt ha 0s
image-registry-reused-connections 4.12 4.10 amd64 sdn aws ha 0s
cache-openshift-api-new-connections 4.12 arm64 sdn aws single 0s
oauth-api-reused-connections 4.12 4.12 arm64 sdn aws ha 0s
kube-api-new-connections 4.12 amd64 ovn ha 0s
ci-cluster-network-liveness-reused-connections 4.12 4.12 arm64 ovn aws ha 0s
cache-openshift-api-reused-connections 4.12 amd64 sdn vsphere ha 0s
service-load-balancer-with-pdb-new-connections 4.12 4.12 arm64 sdn aws ha 0s
cache-kube-api-new-connections 4.12 amd64 ovn ha 0s
cache-kube-api-reused-connections 4.12 amd64 ovn metal ha 0s
ingress-to-console-reused-connections 4.12 arm64 sdn aws ha 0s
ingress-to-oauth-server-reused-connections 4.12 arm64 sdn aws single 0s
oauth-api-new-connections 4.12 amd64 ovn ha 0s
cache-openshift-api-reused-connections 4.12 amd64 ovn metal ha 0s
cache-openshift-api-reused-connections 4.12 amd64 ovn gcp ha 0s
oauth-api-reused-connections 4.12 arm64 sdn aws single 0s
service-load-balancer-with-pdb-reused-connections 4.12 4.10 amd64 sdn aws ha 0s
kube-api-reused-connections 4.12 amd64 sdn vsphere ha 0s
ingress-to-oauth-server-new-connections 4.12 amd64 sdn ha 0s
ingress-to-console-reused-connections 4.12 amd64 ovn azure ha 0s
oauth-api-reused-connections 4.12 amd64 sdn ovirt ha 0s
kube-api-reused-connections 4.12 arm64 sdn aws single 0s
kube-api-new-connections 4.12 amd64 ovn azure ha 0s

Signed-off-by: ehila [email protected]

eggfoobar avatar Sep 21 '22 06:09 eggfoobar

@eggfoobar: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-upgrade c03aae14903ed366335821aef86ac7ac413efbbd link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-aws-ovn-cgroupsv2 c03aae14903ed366335821aef86ac7ac413efbbd link false /test e2e-aws-ovn-cgroupsv2
ci/prow/e2e-gcp-ovn-rt-upgrade c03aae14903ed366335821aef86ac7ac413efbbd link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-ovn-serial c03aae14903ed366335821aef86ac7ac413efbbd link true /test e2e-aws-ovn-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Sep 21 '22 09:09 openshift-ci[bot]

/approve

deads2k avatar Sep 21 '22 13:09 deads2k

/cc @dgoodwin

Hey Devan, so this was generated yesterday, with the things you've merged in to remove the double count, should we wait until the end of the week to update this data set, or is this good enough to go in now?

I can also re-run it now with today's data to get the latest.

eggfoobar avatar Sep 21 '22 16:09 eggfoobar

/lgtm

xueqzhan avatar Sep 21 '22 18:09 xueqzhan

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, eggfoobar, xueqzhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Sep 21 '22 18:09 openshift-ci[bot]

Some things look really off here.

Some interesting things, a lot of arm64 alert entries disappeared, @deepsm007 do you have any idea why this might be? Did we somehow stop running alert data on arm jobs? Or did we somehow stop collecting it, or did some jobs disappear?

We also have a 45 minute increase for 3 ingress console bakends on HA? And a pile of 1-5 minute increases? These are absolute eternities in disruption, something seems very wrong.

Looking at: ingress-to-console-new-connections | 4.12 |   | amd64 | sdn | aws | ha | 44m42.75s Link: https://datastudio.google.com/s/v2J04D7Pztc I see the 7 day rolling average as 1-2 seconds, not 45 minutes...

Investigating more, this comment may get more updates, bear with me.

Ok I wasn't using the right FromRelease, this is for "", I don't even know what that means? Not an upgrade, i.e. a straight e2e job? I also need to remember these are P95's and P99's, not averages.

$ cat incoming-query_results.json| jq '.[] | select(.Platform == "aws" and .Topology == "ha" and .BackendName == "ingress-to-console-new-connections" and .Architecture == "amd64" and .Network == "sdn")'

{
  "BackendName": "ingress-to-console-new-connections",
  "Release": "4.12",
  "FromRelease": "4.10",
  "Platform": "aws",
  "Architecture": "amd64",
  "Network": "sdn",
  "Topology": "ha",
  "P95": "7.3999999999999995",
  "P99": "7.88"
}
{
  "BackendName": "ingress-to-console-new-connections",
  "Release": "4.12",
  "FromRelease": "4.11",
  "Platform": "aws",
  "Architecture": "amd64",
  "Network": "sdn",
  "Topology": "ha",
  "P95": "87.299999999999926",
  "P99": "126.47999999999999"
}
{
  "BackendName": "ingress-to-console-new-connections",
  "Release": "4.12",
  "FromRelease": "4.12",
  "Platform": "aws",
  "Architecture": "amd64",
  "Network": "sdn",
  "Topology": "ha",
  "P95": "9.0",
  "P99": "26.95999999999998"
}
{
  "BackendName": "ingress-to-console-new-connections",
  "Release": "4.12",
  "FromRelease": "",
  "Platform": "aws",
  "Architecture": "amd64",
  "Network": "sdn",
  "Topology": "ha",
  "P95": "659.74999999998965",
  "P99": "2708.65"
}

Indeed the P95's and P99's are pretty wildly off the charts, but what is a FromRelease ""?

dgoodwin avatar Sep 22 '22 11:09 dgoodwin

Some interesting things, a lot of arm64 alert entries disappeared, @deepsm007 do you have any idea why this might be? Did we somehow stop running alert data on arm jobs? Or did we somehow stop collecting it, or did some jobs disappear?

We didn't change any alerts on arm jobs nor we stopped previous jobs. There were some network issues with all the arm instances across the board, still investigating the root cause.

deepsm007 avatar Sep 22 '22 12:09 deepsm007

I uncovered why the ingress-to-console-new-connections is so high, it's one particular job reporting over 2500s every time, periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-no-capabilities, it looks like it may be new as of Sep 10 according to the results in BigQuery.

SELECT JobName,JobRunName,JobRunStartTime,DisruptionSeconds FROM `openshift-ci-data-analysis.ci_data.UnifiedBackendDisruption` 
WHERE BackendName = 'ingress-to-console-new-connections' AND Release="4.12" and FromRelease="" AND Platform="aws" and Network="sdn" AND Topology="ha" and Architecture="amd64" AND DisruptionSeconds > 60
ORDER BY JobRunStartTime DESC
LIMIT 1000

Now what to do about it...

dgoodwin avatar Sep 22 '22 14:09 dgoodwin

Oh! That makes sense, that job is turning off console all together and will be turning off other operators as other capabilities get added to that list. @wking Would have more insight into that

eggfoobar avatar Sep 22 '22 15:09 eggfoobar

https://issues.redhat.com/browse/TRT-573 is filed to fix the massive disrupt from no-capabilities jobs.

Someone still needs to look into the 3-5 minute jumps on a bunch of critical backends, need to know why that happened. These are not really levels of disruption we can accept as passing.

dgoodwin avatar Sep 23 '22 11:09 dgoodwin

SELECT JobName,JobRunName,JobRunStartTime,DisruptionSeconds FROM `openshift-ci-data-analysis.ci_data.UnifiedBackendDisruption` 
WHERE BackendName = 'openshift-api-reused-connections' AND Release="4.12" AND Platform="aws" and Network="sdn" AND Topology="ha" and Architecture="amd64" AND FromRelease=""
ORDER BY JobRunStartTime DESC
LIMIT 1000

indicates that release-openshift-origin-installer-e2e-aws-disruptive-4.12 must also get filtered, values over 300 seconds every time, normally 1-2.

dgoodwin avatar Sep 23 '22 12:09 dgoodwin

The 3 minute azure sdn jump I think was an incident that has since resolved, I suspect on regen these will go away. I don't see anything over 120s since Sep 6.

dgoodwin avatar Sep 23 '22 13:09 dgoodwin

I think that's it for now, we'll look again on next attempt, but we can't merge this one as the data is too wild. TRT-573 is filed to correct the issues found so far.

/close

dgoodwin avatar Sep 23 '22 13:09 dgoodwin

@dgoodwin: Closed this PR.

In response to this:

I think that's it for now, we'll look again on next attempt, but we can't merge this one as the data is too wild. TRT-573 is filed to correct the issues found so far.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci[bot] avatar Sep 23 '22 13:09 openshift-ci[bot]