boulder icon indicating copy to clipboard operation
boulder copied to clipboard

bdns test: Expected B server to have non-zero lookup attempts

Open pgporada opened this issue 2 years ago • 6 comments
trafficstars

Spotted here while running tests for the go 1.20.3 security release, but I've seen this randomly pop up before only on Github CI.

--- FAIL: TestRotateServerOnErr (0.00s)
    dns_test.go:793: Expected B server to have non-zero lookup attempts
FAIL
FAIL	github.com/letsencrypt/boulder/bdns	0.231s

pgporada avatar Apr 04 '23 20:04 pgporada

Spotted again here.

 [a:53 b:53 [2606:4700:4700::1111]:53]
--- FAIL: TestRotateServerOnErr (0.00s)
    dns_test.go:793: Expected B server to have non-zero lookup attempts
FAIL
FAIL	github.com/letsencrypt/boulder/bdns	0.258s

pgporada avatar May 23 '23 19:05 pgporada

This appears related too, or at least was a random failure I spotted yesterday.

20:37:06.724792 6 boulder-ra 3L3Eww8 [AUDIT] Certificate request - error JSON={"ID":"bwcSV40cF9ZWynwNrM3TTEOGL5006pbTVEhOiZg9wEI","Requester":1,"OrderID":1,"VerifiedFields":["subject.commonName","subjectAltName"],"NotBefore":"0001-01-01T00:00:00Z","NotAfter":"0001-01-01T00:00:00Z","RequestTime":"2022-11-24T20:37:01Z","ResponseTime":"2022-11-24T20:37:01Z","Error":"issuing precertificate: rpc error: code = Unavailable desc = last resolver error: produced zero addresses","Authorizations":{"rand.2c96a7da.xyz":{"ID":"1","ChallengeType":"dns-01"}}}

Edit: It was not related. This specific error was fixed by https://github.com/letsencrypt/boulder/pull/6916.

pgporada avatar May 23 '23 20:05 pgporada

I was able to reproduce the initial failure after ~1000 iterations on my machine.

docker compose run boulder bash
watch -n0.23 "go test -v ./bdns -count=1 -run TestRotateServerOnErr"

pgporada avatar May 23 '23 21:05 pgporada

I found this test failure on main just now which seems eerily related.

 Starting service boulder-ra-2
Starting service boulder-wfe2
Starting service bad-key-revoker
Starting service ocsp-responder
All servers running. Hit ^C to kill.
Traceback (most recent call last):
  File "test/integration-test.py", line 206, in <module>
    main()
  File "test/integration-test.py", line 86, in main
    setup_twenty_days_ago()
  File "/boulder/test/helpers.py", line 152, in setup_twenty_days_ago
    f()
  File "/boulder/test/v2_integration.py", line 1429, in caa_recheck_setup
    client = chisel2.make_client()
  File "/boulder/test/chisel2.py", line 59, in make_client
    client.net.account = client.new_account(messages.NewRegistration.from_data(email=email,
  File "/usr/local/lib/python3.8/dist-packages/acme/client.py", line 62, in new_account
    response = self._post(self.directory['newAccount'], new_account)
  File "/usr/local/lib/python3.8/dist-packages/acme/client.py", line 366, in _post
    return self.net.post(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/acme/client.py", line 739, in post
    return self._post_once(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/acme/client.py", line 752, in _post_once
    response = self._check_response(response, content_type=content_type)
  File "/usr/local/lib/python3.8/dist-packages/acme/client.py", line 603, in _check_response
    raise messages.Error.from_json(jobj)
acme.messages.Error: urn:ietf:params:acme:error:badPublicKey :: The JWS was signed by a public key the server does not support :: rpc error: code = Unavailable desc = last resolver error: produced zero addresses

pgporada avatar Jun 02 '23 19:06 pgporada

That most recent test failure (urn:ietf:params:acme:error:badPublicKey :: The JWS was signed by a public key the server does not support :: rpc error: code = Unavailable desc = last resolver error: produced zero addresses) is unrelated. What's happening there is that when our gRPC client tries to connect to some gRPC server, it first tries to look it up in consul. For whatever reason, it's getting zero addresses when it tries to do that.

There's another problem here: the rpc-related message is supposed to be internal. We should inhibit it from getting to the user because it's not meaningful to them.

jsha avatar Jun 02 '23 20:06 jsha

I was getting this consistently on start servers after adding new health checks in #6928.

  1. Increasing the polling rate of the new (service-specific) consul checks from 5s to 2s made this go away, however it did happen a couple more time sporadically

  2. Increasing the polling rate of the health service (non-service-specific) consul checks from 5s to 2s made this issue go away entirely

  3. You're absolutely right that we shouldn't be seeing this internal message, even if this issue is resolved by #6928 we will still need to dig into this aspect.

beautifulentropy avatar Jun 08 '23 21:06 beautifulentropy