pkg
pkg copied to clipboard
[flaky] TestMetricsExport is super flaky
metrics/resource_view_test.go - TestMetricsExport flakes most of the time.
/assign @jjzeng-seattle @evankanderson
here's a good example: https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/1666/pull-knative-pkg-unit-tests/1300475037081407489
cc @yanweiguo (community oncall)
I ran the tests hundreds times with my own cluster and could only reproduce the timeout error once. I guess I have to send out a PR to run the CICD to debug.
/reopen
I still see issues, though with a different signature. Collecting signatures here.
=== CONT TestMetricsExport
resource_view_test.go:334: Created exporter at localhost:12345
logger.go:130: 2020-09-27T02:15:16.884Z INFO metrics/exporter.go:155 Flushing the existing exporter before setting up the new exporter.
logger.go:130: 2020-09-27T02:15:16.940Z ERROR websocket/connection.go:138 Websocket connection could not be established {"error": "dial tcp: lookup somewhere.not.exist on 10.7.240.10:53: no such host"}
logger.go:130: 2020-09-27T02:15:16.975Z INFO metrics/opencensus_exporter.go:56 Created OpenCensus exporter with config: {"config": {}}
logger.go:130: 2020-09-27T02:15:16.975Z INFO metrics/exporter.go:168 Successfully updated the metrics exporter; old config: &{knative.dev/serving testComponent prometheus 1000000000 <nil> <nil> false 19090 false { false}}; new config &{knative.dev/serving testComponent opencensus 1000000000 <nil> <nil> localhost:12345 false 0 false { false}}
and then:
resource_view_test.go:370: Timeout reading input
resource_view_test.go:376: Unexpected OpenCensus exports (-want +got):
[]metrics.metricExtract(Inverse(Sort, []string{
"knative.dev/serving/testComponent/global_export_counts<>:2",
"knative.dev/serving/testComponent/resource_global_export_count<>:2",
`knative.dev/serving/testComponent/testing/value<project="p1",rev`...,
- `knative.dev/serving/testComponent/testing/value<project="p1",revision="r2">:1`,
}))
@evankanderson: Reopened this issue.
In response to this:
/reopen
I still see issues, though with a different signature. Collecting signatures here.
=== CONT TestMetricsExport resource_view_test.go:334: Created exporter at localhost:12345 logger.go:130: 2020-09-27T02:15:16.884Z INFO metrics/exporter.go:155 Flushing the existing exporter before setting up the new exporter. logger.go:130: 2020-09-27T02:15:16.940Z ERROR websocket/connection.go:138 Websocket connection could not be established {"error": "dial tcp: lookup somewhere.not.exist on 10.7.240.10:53: no such host"} logger.go:130: 2020-09-27T02:15:16.975Z INFO metrics/opencensus_exporter.go:56 Created OpenCensus exporter with config: {"config": {}} logger.go:130: 2020-09-27T02:15:16.975Z INFO metrics/exporter.go:168 Successfully updated the metrics exporter; old config: &{knative.dev/serving testComponent prometheus 1000000000 <nil> <nil> false 19090 false { false}}; new config &{knative.dev/serving testComponent opencensus 1000000000 <nil> <nil> localhost:12345 false 0 false { false}}
and then:
resource_view_test.go:370: Timeout reading input resource_view_test.go:376: Unexpected OpenCensus exports (-want +got): []metrics.metricExtract(Inverse(Sort, []string{ "knative.dev/serving/testComponent/global_export_counts<>:2", "knative.dev/serving/testComponent/resource_global_export_count<>:2", `knative.dev/serving/testComponent/testing/value<project="p1",rev`..., - `knative.dev/serving/testComponent/testing/value<project="p1",revision="r2">:1`, }))
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I'm slightly suspicious, seeing interleaved logging from different tests, if we're seeing side effects of the global monitoring singleton.
Unfortunately, it seems like it's hard to adjust our current prow test infrastructure to run these separately; let me look into doing it via GitHub Actions.
Update: I've managed to reproduce this about 1/50 times when running all the tests under the e2e script.
It looks like somehow the default exporter is sometimes trying to export to the default localhost:55678
address rather than the address in the config. I'm still trying to figure out why this happens.
I've also found a small bug in the Prometheus exporter where it won't necessarily re-create the exporter if the port changes. Since that seems to happen rarely in the current scenarios, I'm going to roll that with the other change.
I've fixed a few bugs, but I'm still seeing one case where the default meter seems to have lost track of all of the metrics associated with it.
@evankanderson any updates on this?
@evankanderson I still see this issue here: https://github.com/knative/pkg/pull/2005#issuecomment-776174080
/reopen
TestMetricsExport/OpenCensus
https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2189/pull-knative-pkg-unit-tests/1415384970213462016
@dprotaso: Reopened this issue.
In response to this:
/reopen
TestMetricsExport/OpenCensus
https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2189/pull-knative-pkg-unit-tests/1415384970213462016
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
/remove-lifecycle stale
still messed up https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2328/pull-knative-pkg-unit-tests/1453084081465069568
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
/remove-lifecycle stale
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
/lifecycle frozen