boulder
boulder copied to clipboard
bdns test: Expected B server to have non-zero lookup attempts
Spotted here while running tests for the go 1.20.3 security release, but I've seen this randomly pop up before only on Github CI.
--- FAIL: TestRotateServerOnErr (0.00s)
dns_test.go:793: Expected B server to have non-zero lookup attempts
FAIL
FAIL github.com/letsencrypt/boulder/bdns 0.231s
Spotted again here.
[a:53 b:53 [2606:4700:4700::1111]:53]
--- FAIL: TestRotateServerOnErr (0.00s)
dns_test.go:793: Expected B server to have non-zero lookup attempts
FAIL
FAIL github.com/letsencrypt/boulder/bdns 0.258s
This appears related too, or at least was a random failure I spotted yesterday.
20:37:06.724792 6 boulder-ra 3L3Eww8 [AUDIT] Certificate request - error JSON={"ID":"bwcSV40cF9ZWynwNrM3TTEOGL5006pbTVEhOiZg9wEI","Requester":1,"OrderID":1,"VerifiedFields":["subject.commonName","subjectAltName"],"NotBefore":"0001-01-01T00:00:00Z","NotAfter":"0001-01-01T00:00:00Z","RequestTime":"2022-11-24T20:37:01Z","ResponseTime":"2022-11-24T20:37:01Z","Error":"issuing precertificate: rpc error: code = Unavailable desc = last resolver error: produced zero addresses","Authorizations":{"rand.2c96a7da.xyz":{"ID":"1","ChallengeType":"dns-01"}}}
Edit: It was not related. This specific error was fixed by https://github.com/letsencrypt/boulder/pull/6916.
I was able to reproduce the initial failure after ~1000 iterations on my machine.
docker compose run boulder bash
watch -n0.23 "go test -v ./bdns -count=1 -run TestRotateServerOnErr"
I found this test failure on main just now which seems eerily related.
Starting service boulder-ra-2
Starting service boulder-wfe2
Starting service bad-key-revoker
Starting service ocsp-responder
All servers running. Hit ^C to kill.
Traceback (most recent call last):
File "test/integration-test.py", line 206, in <module>
main()
File "test/integration-test.py", line 86, in main
setup_twenty_days_ago()
File "/boulder/test/helpers.py", line 152, in setup_twenty_days_ago
f()
File "/boulder/test/v2_integration.py", line 1429, in caa_recheck_setup
client = chisel2.make_client()
File "/boulder/test/chisel2.py", line 59, in make_client
client.net.account = client.new_account(messages.NewRegistration.from_data(email=email,
File "/usr/local/lib/python3.8/dist-packages/acme/client.py", line 62, in new_account
response = self._post(self.directory['newAccount'], new_account)
File "/usr/local/lib/python3.8/dist-packages/acme/client.py", line 366, in _post
return self.net.post(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/acme/client.py", line 739, in post
return self._post_once(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/acme/client.py", line 752, in _post_once
response = self._check_response(response, content_type=content_type)
File "/usr/local/lib/python3.8/dist-packages/acme/client.py", line 603, in _check_response
raise messages.Error.from_json(jobj)
acme.messages.Error: urn:ietf:params:acme:error:badPublicKey :: The JWS was signed by a public key the server does not support :: rpc error: code = Unavailable desc = last resolver error: produced zero addresses
That most recent test failure (urn:ietf:params:acme:error:badPublicKey :: The JWS was signed by a public key the server does not support :: rpc error: code = Unavailable desc = last resolver error: produced zero addresses) is unrelated. What's happening there is that when our gRPC client tries to connect to some gRPC server, it first tries to look it up in consul. For whatever reason, it's getting zero addresses when it tries to do that.
There's another problem here: the rpc-related message is supposed to be internal. We should inhibit it from getting to the user because it's not meaningful to them.
I was getting this consistently on start servers after adding new health checks in #6928.
-
Increasing the polling rate of the new (service-specific) consul checks from 5s to 2s made this go away, however it did happen a couple more time sporadically
-
Increasing the polling rate of the health service (non-service-specific) consul checks from 5s to 2s made this issue go away entirely
-
You're absolutely right that we shouldn't be seeing this internal message, even if this issue is resolved by #6928 we will still need to dig into this aspect.