blackbox_exporter
blackbox_exporter copied to clipboard
TestGRPCConnection unit test fails
Running go test ./... on a checkout of v0.22.0, the TestGRPCConnection test consistently fails as follows:
--- FAIL: TestGRPCConnection (0.00s)
grpc_test.go:73: GRPC probe failed
panic: Fail in goroutine after TestGRPCConnection has completed
goroutine 579 [running]:
testing.(*common).Fail(0xc000318820)
/usr/lib/go-1.19/src/testing/testing.go:824 +0xe5
testing.(*common).Errorf(0xc000318820, {0xc2f5b0?, 0xc0002df8f0?}, {0xc000333fc0?, 0x0?, 0xc000601820?})
/usr/lib/go-1.19/src/testing/testing.go:941 +0x65
github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection.func1()
/home/daniel/Work/Debian/prometheus-blackbox-exporter/prober/grpc_test.go:56 +0x6e
created by github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection
/home/daniel/Work/Debian/prometheus-blackbox-exporter/prober/grpc_test.go:54 +0x325
FAIL github.com/prometheus/blackbox_exporter/prober 0.013s
FAIL
Tested with Go 1.19.1 on linux/amd64.
This still fails identically with v0.23.0 (and master branch, as of writing).
What's more, isolating the tests to just the GRPC tests fails in a completely different way:
$ go test ./...
ok github.com/prometheus/blackbox_exporter (cached)
ok github.com/prometheus/blackbox_exporter/config (cached)
--- FAIL: TestGRPCConnection (0.00s)
grpc_test.go:73: GRPC probe failed
panic: Fail in goroutine after TestGRPCConnection has completed
goroutine 596 [running]:
testing.(*common).Fail(0xc000301d40)
/usr/lib/go-1.19/src/testing/testing.go:824 +0xe5
testing.(*common).Errorf(0xc000301d40, {0xc68a18?, 0xc000391218?}, {0xc000521fc0?, 0x12b4c70?, 0xc000529a90?})
/usr/lib/go-1.19/src/testing/testing.go:941 +0x65
github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection.func1()
/tmp/blackbox_exporter/prober/grpc_test.go:56 +0x6e
created by github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection
/tmp/blackbox_exporter/prober/grpc_test.go:54 +0x325
FAIL github.com/prometheus/blackbox_exporter/prober 0.012s
FAIL
vs.
$ go test -run TestGRPC ./...
ok github.com/prometheus/blackbox_exporter (cached) [no tests to run]
ok github.com/prometheus/blackbox_exporter/config (cached) [no tests to run]
--- FAIL: TestGRPCConnection (0.00s)
grpc_test.go:73: GRPC probe failed
--- FAIL: TestGRPCTLSConnection (0.14s)
grpc_test.go:237: GRPC probe failed
--- FAIL: TestGRPCServiceNotFound (0.00s)
utils_test.go:48: Expected: probe_grpc_status_code: 5, got: probe_grpc_status_code: 0
--- FAIL: TestGRPCHealthCheckUnimplemented (0.00s)
utils_test.go:48: Expected: probe_grpc_status_code: 12, got: probe_grpc_status_code: 0
FAIL
FAIL github.com/prometheus/blackbox_exporter/prober 0.148s
FAIL
In fact, the latter is inconsistent. Sometimes the gRPC server error message is reported:
--- FAIL: TestGRPCTLSConnection (0.06s)
grpc_test.go:237: GRPC probe failed
grpc_test.go:218: failed to serve: grpc: the server has been stopped
Other times, the test panics as it does when running the full suite of tests:
$ go test -run TestGRPC ./...
ok github.com/prometheus/blackbox_exporter (cached) [no tests to run]
ok github.com/prometheus/blackbox_exporter/config (cached) [no tests to run]
--- FAIL: TestGRPCConnection (0.00s)
grpc_test.go:73: GRPC probe failed
--- FAIL: TestGRPCTLSConnection (0.05s)
grpc_test.go:237: GRPC probe failed
panic: Fail in goroutine after TestGRPCTLSConnection has completed
goroutine 53 [running]:
testing.(*common).Fail(0xc0002b6000)
/usr/lib/go-1.19/src/testing/testing.go:824 +0xe5
testing.(*common).Errorf(0xc0002b6000, {0xc68a18?, 0xc000281098?}, {0xc0004087c0?, 0x0?, 0x0?})
/usr/lib/go-1.19/src/testing/testing.go:941 +0x65
github.com/prometheus/blackbox_exporter/prober.TestGRPCTLSConnection.func1()
/tmp/blackbox_exporter/prober/grpc_test.go:218 +0x6e
created by github.com/prometheus/blackbox_exporter/prober.TestGRPCTLSConnection
/tmp/blackbox_exporter/prober/grpc_test.go:216 +0x805
FAIL github.com/prometheus/blackbox_exporter/prober 0.055s
FAIL
Something smells racy here.
@averzicco As the author of the GRPC tests, are you able to reproduce the failures that I see here? I'm surprised this made it past CI when being merged.
hi @dswarbrick,
I've quickly checkout the code, all test are successful:
$ git clone [email protected]:prometheus/blackbox_exporter.git
Cloning into 'blackbox_exporter'...
remote: Enumerating objects: 5159, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 5159 (delta 4), reused 2 (delta 1), pack-reused 5149
Receiving objects: 100% (5159/5159), 6.95 MiB | 7.05 MiB/s, done.
Resolving deltas: 100% (2912/2912), done.
$ cd blackbox_exporter/
$ make test
>> running all tests
go test -race ./...
go: downloading github.com/miekg/dns v1.1.50
go: downloading github.com/go-kit/log v0.2.1
go: downloading github.com/prometheus/common v0.39.0
go: downloading github.com/prometheus/client_golang v1.14.0
go: downloading golang.org/x/net v0.4.0
go: downloading google.golang.org/grpc v1.51.0
go: downloading github.com/prometheus/client_model v0.3.0
go: downloading github.com/prometheus/exporter-toolkit v0.8.2
go: downloading golang.org/x/sys v0.3.0
go: downloading golang.org/x/oauth2 v0.3.0
go: downloading github.com/matttproud/golang_protobuf_extensions v1.0.4
go: downloading golang.org/x/text v0.5.0
go: downloading github.com/coreos/go-systemd/v22 v22.4.0
go: downloading golang.org/x/sync v0.1.0
ok github.com/prometheus/blackbox_exporter 0.042s
ok github.com/prometheus/blackbox_exporter/config 0.081s
ok github.com/prometheus/blackbox_exporter/prober 2.860s
$ go test ./...
ok github.com/prometheus/blackbox_exporter 0.006s
ok github.com/prometheus/blackbox_exporter/config 0.014s
ok github.com/prometheus/blackbox_exporter/prober 1.906s
$ go test -run TestGRPC ./...
ok github.com/prometheus/blackbox_exporter 0.011s [no tests to run]
ok github.com/prometheus/blackbox_exporter/config 0.009s [no tests to run]
ok github.com/prometheus/blackbox_exporter/prober 0.130s
$ go version
go version go1.19.3 linux/amd64
@averzicco I get inconsistent results, depending on whether I run tests with -race or not.
$ go test -race ./...
ok github.com/prometheus/blackbox_exporter (cached)
ok github.com/prometheus/blackbox_exporter/config (cached)
--- FAIL: TestGRPCConnection (0.00s)
grpc_test.go:73: GRPC probe failed
--- FAIL: TestMultipleGRPCservices (0.00s)
grpc_test.go:140: GRPC probe failed for service1
--- FAIL: TestGRPCTLSConnection (0.13s)
grpc_test.go:237: GRPC probe failed
--- FAIL: TestNoTLSConnection (0.00s)
utils_test.go:48: Expected: probe_grpc_status_code: 14, got: probe_grpc_status_code: 0
--- FAIL: TestGRPCServiceNotFound (0.00s)
utils_test.go:48: Expected: probe_grpc_status_code: 5, got: probe_grpc_status_code: 0
--- FAIL: TestGRPCHealthCheckUnimplemented (0.00s)
utils_test.go:48: Expected: probe_grpc_status_code: 12, got: probe_grpc_status_code: 0
2023/01/03 22:40:07 http: TLS handshake error from 127.0.0.1:43152: remote error: tls: bad certificate
level=info msg="Resolving target address" target=ipv6.google.com ip_protocol=ip4
level=info msg="Resolved target address" target=ipv6.google.com ip=2404:6800:4006:804::200e
level=info msg="Resolving target address" target=ipv6.google.com ip_protocol=ip4
level=error msg="Resolution with IP protocol failed" target=ipv6.google.com ip_protocol=ip4 err="lookup ipv6.google.com: no such host"
FAIL
FAIL github.com/prometheus/blackbox_exporter/prober 1.455s
FAIL
$ go test ./...
ok github.com/prometheus/blackbox_exporter (cached)
ok github.com/prometheus/blackbox_exporter/config (cached)
--- FAIL: TestGRPCConnection (0.00s)
grpc_test.go:73: GRPC probe failed
panic: Fail in goroutine after TestGRPCConnection has completed
goroutine 617 [running]:
testing.(*common).Fail(0xc00023c820)
/usr/lib/go-1.19/src/testing/testing.go:824 +0xe5
testing.(*common).Errorf(0xc00023c820, {0xc6b41c?, 0xc00038a378?}, {0xc0002a9fc0?, 0x12b9b70?, 0xc0000b7cc0?})
/usr/lib/go-1.19/src/testing/testing.go:941 +0x65
github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection.func1()
/tmp/blackbox_exporter/prober/grpc_test.go:56 +0x6e
created by github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection
/tmp/blackbox_exporter/prober/grpc_test.go:54 +0x325
FAIL github.com/prometheus/blackbox_exporter/prober 0.012s
FAIL
That's on Ubuntu 22.10 (amd64) with Go 1.19.2 (and also 1.19.4, installed from Ubuntu lunar repos), Intel Core i7 (16 cores). ~The same results are observed when testing in a Debian sid schroot on that hardware.~
Running the tests in a Debian sid schroot on the same hardware, or on a Debian testing (bookworm) Intel Core i3, the tests pass.
Based on the output you posted the result is not inconsistent, the test failed in both cases.
I've re-run the test with the -race flag (which is used when running) and are still successful:
$ go test -race ./...
ok github.com/prometheus/blackbox_exporter 0.050s
ok github.com/prometheus/blackbox_exporter/config 0.065s
ok github.com/prometheus/blackbox_exporter/prober 3.121s
$ go test -race -run TestGRPC ./...
ok github.com/prometheus/blackbox_exporter 0.039s [no tests to run]
ok github.com/prometheus/blackbox_exporter/config 0.043s [no tests to run]
ok github.com/prometheus/blackbox_exporter/prober 0.200s
I've also tried running the test on a 32 cores debian 10 machine, test are successful. Not sure what can be the issue here, since you can reproduce it locally, could you try to run the test in debug mode to check what is failing?
@averzicco By "inconsistent" I mean that the test are failing in different ways - but yes, they consistently fail.
A small correction to my earlier comment - the tests pass in the Debian sid schroot, yet fail when run natively on the Ubuntu 22.10 which hosts the schroot envs. That would seem to suggest something weird with Ubuntu's Go toolchain, which I find a bit hard to swallow.
Also, despite the tests passing in a Debian sid schroot on the Core i7, and natively on a Debian testing Core i3, these tests have been quite flaky on the Debian CI infrastructure which builds the prometheus-blackbox-exporter package (for a variety of archs), e.g. https://tests.reproducible-builds.org/debian/rbuild/unstable/amd64/prometheus-blackbox-exporter_0.23.0-1.rbuild.log.gz
for what i see in the log you linked, it seems something is wrong with the grpc build-deps (the TestGRPCConnection and TestMultipleGRPCservices gets skipped), maybe the TestGRPCTLSConnection doesn't get skipped because the server is created with TLS enabled? I guess that the important part is that something is trying to skip all gRPC tests but doesn't skip TestGRPCTLSConnection
=== RUN TestGRPCConnection
grpc_test.go:38: Skipping unsupported test; gRPC build-deps too oudated.
--- SKIP: TestGRPCConnection (0.00s)
=== RUN TestMultipleGRPCservices
grpc_test.go:104: Skipping unsupported test; gRPC build-deps too oudated.
--- SKIP: TestMultipleGRPCservices (0.00s)
=== RUN TestGRPCTLSConnection
grpc_test.go:239: GRPC probe failed
--- FAIL: TestGRPCTLSConnection (1.20s)
=== RUN TestNoTLSConnection
panic: Fail in goroutine after TestGRPCTLSConnection has completed
@averzicco Disregard the skipped tests; they were previously patched out due to older grpc build-deps in Debian, unable to support grpc.WithTransportCredentials(insecure.NewCredentials())) in the grpc prober code. That has now been patched to grpc.WithInsecure(), which is supported by the build-deps.
I think the tests are racy when executed on slower / busy systems (which Debian CI infrastructure often is). Look closely at the failure:
--- FAIL: TestGRPCConnection (0.00s)
grpc_test.go:73: GRPC probe failed
panic: Fail in goroutine after TestGRPCConnection has completed
goroutine 612 [running]:
testing.(*common).Fail(0xc00023ad00)
/usr/lib/go-1.19/src/testing/testing.go:824 +0xe5
testing.(*common).Errorf(0xc00023ad00, {0xc68a18?, 0xc000314390?},
{0xc00030dfc0?, 0x0?, 0xc000394ea0?})
/usr/lib/go-1.19/src/testing/testing.go:941 +0x65
github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection.func1()
/tmp/blackbox_exporter/prober/grpc_test.go:56 +0x6e
created by github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection
/tmp/blackbox_exporter/prober/grpc_test.go:54 +0x325
I think it is due to the inherent raciness caused by tests which spin up http servers, tcp servers etc in goroutines within the same test.
s := grpc.NewServer()
healthServer := health.NewServer()
healthServer.SetServingStatus("service", grpc_health_v1.HealthCheckResponse_SERVING)
grpc_health_v1.RegisterHealthServer(s, healthServer)
go func() {
if err := s.Serve(ln); err != nil {
t.Errorf("failed to serve: %v", err)
return
}
}()
defer s.GracefulStop()
testCTX, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
registry := prometheus.NewRegistry()
result := ProbeGRPC(testCTX, "localhost:"+port,
config.Module{Timeout: time.Second, GRPC: config.GRPCProbe{
IPProtocolFallback: false,
},
}, registry, log.NewNopLogger())
if !result {
t.Fatalf("GRPC probe failed")
}
I suspect that the grpc server in the goroutine is not ready in time, so ProbeGRPC() fails, and the deferred s.GracefulStop() is called as the test exits. However, the goroutine is still in the s.Serve() loop, so... panic.
It wouldn't be the first time that I've needed to add a small delay after the goroutine to give tcp / http server time to initialize before a request is fired at it.
Putting aside the goroutine panic for a moment, I found out why the other type of test failure occurs, e.g.
--- FAIL: TestGRPCConnection (0.00s)
grpc_test.go:73: GRPC probe failed
--- FAIL: TestGRPCTLSConnection (0.14s)
grpc_test.go:237: GRPC probe failed
--- FAIL: TestGRPCServiceNotFound (0.00s)
utils_test.go:48: Expected: probe_grpc_status_code: 5, got: probe_grpc_status_code: 0
--- FAIL: TestGRPCHealthCheckUnimplemented (0.00s)
utils_test.go:48: Expected: probe_grpc_status_code: 12, got: probe_grpc_status_code: 0
It seems that the lookup of "localhost" is failing in ProbeGRPC due to the probe module DNS resolution options.
address localhost: no suitable address found
Setting IPProtocolFallback: true fixes that. Even trying to probe "127.0.0.1:"+port in the test fails, unless PreferredIPProtocol: "ip4" (or simply "ip") is set. Digging through the code, it seems that blackbox_exporter defaults to "ip6" for name resolution. So even if we were to probe the listener address directly (e.g. 127.0.0.1:54321), the name resolution would still fail if it were forced to treat that as IPv6.
@averzicco Out of curiosity, are you setting IPProtocolFallback: false for a specific reason? The other tests in this repo set it to true, and the blackbox_exporter config docs also state that it is true by default.
I don't yet fully understand why a little test program produces these results on Ubuntu:
net.ResolveIPAddr("ip", "localhost"): 127.0.0.1
net.ResolveIPAddr("ip4", "localhost"): 127.0.0.1
net.ResolveIPAddr("ip6", "localhost"): err: address localhost: no suitable address found
... despite this:
$ host localhost
localhost has address 127.0.0.1
localhost has IPv6 address ::1
Running the same little test program on Debian produces the expected results:
net.ResolveIPAddr("ip", "localhost"): 127.0.0.1
net.ResolveIPAddr("ip4", "localhost"): 127.0.0.1
net.ResolveIPAddr("ip6", "localhost"): ::1
I tracked the strange failure of Go's (and Python's) inability to resolve "localhost" to "::1" on Ubuntu to the lack of an entry for it in /etc/hosts.
It seems that an out-of-the-box /etc/hosts on Ubuntu contains the following entries for IPv6:
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
... whereas a default Debian /etc/hosts contains (as per netcfg.h):
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
I'm not sure yet what generates the default /etc/hosts on Ubuntu, but I don't think it is netcfg anymore.
So, on Ubuntu, "localhost" was only resolving to "127.0.0.1" in net.Listen("tcp", "localhost:0"), and the combination of IPProtocolFallback: false with the default PreferredIPProtocol: "ip6" meant that blackbox_exporter's chooseProtocol function was failing to resolve "localhost" to an IPv6 address.
Sorry, more than one year has passed since I wrote that code and I don't remember why I've used IPProtocolFallback:false in the test maybe there was a good reason or I just tried with that and worked...
I was able to reproduce it on master with make test, or via go test ./... on my local machine (running go version go1.21.4 linux/amd64)
but it works fine when I run it the way we run it in CI,
we run it like this in CI:
export DOCKER_TEST_IMAGE_NAME=quay.io/prometheus/golang-builder:1.20-base
# setup ipv6, add following to `/etc/docker/daemon.json`
{
"ipv6": true,
"fixed-cidr-v6": "2001:db8:1::/64"
}
# restart docker deamon
sudo service docker restart
# run it
docker run --rm -t -v "$(pwd):/app" "${DOCKER_TEST_IMAGE_NAME}" -i github.com/prometheus/blackbox_exporter -T
this makes me think that there is something different we are doing in CI, and we need to do that in local.
I don't have time to dig down further right now so just sharing my findings.