teleport icon indicating copy to clipboard operation
teleport copied to clipboard

`TestIntegrations/Discovery` flakiness

Open ravicious opened this issue 1 year ago • 23 comments

Failure

Link(s) to logs

  • https://github.com/gravitational/teleport/actions/runs/5043488987/jobs/9045336040#step:6:2379
  • v12 https://github.com/gravitational/teleport/actions/runs/5623334769/job/15237803399
  • https://github.com/gravitational/teleport/actions/runs/7009114863/job/19066880309?pr=34961

Relevant snippet

    integration_test.go:3900: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:3900
        	            				/__w/teleport/teleport/integration/integration_test.go:114
        	Error:      	Received unexpected error:
        	            	failed connecting to host 127.0.0.1:39505: failed to dial target host
        	            		unable to fetch access point, this proxy clusterPeer(TunnelConnection(name=2f33b91b-5b99-43e7-b35e-37c32bcfec48-cluster-remote, type=, cluster=cluster-remote, proxy=2f33b91b-5b99-43e7-b35e-37c32bcfec48)) has not been discovered yet, try again later
        	Test:       	TestIntegrations/Discovery
     integration_test.go:3664: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:3664
        	            				/__w/teleport/teleport/integration/integration_test.go:109
        	Error:      	Received unexpected error:
        	            	timed out, only 0/2 events received. received: [], expected: [ProxyReverseTunnelReady ProxySSHReady]
        	Test:       	TestIntegrations/Discovery

ravicious avatar May 22 '23 08:05 ravicious

This just blocked a v13 release: https://github.com/gravitational/teleport/actions/runs/5903184778/job/16012575863

    integration_test.go:4018: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:4018
        	            				/__w/teleport/teleport/integration/integration_test.go:119
        	Error:      	Received unexpected error:
        	            	failed connecting to host 127.0.0.1:44391: failed to dial target host
        	            		unable to fetch access point, this proxy clusterPeer(TunnelConnection(name=19978ea4-1790-45e8-bb71-af0b41581184-cluster-remote, type=, cluster=cluster-remote, proxy=19978ea4-1790-45e8-bb71-af0b41581184)) has not been discovered yet, try again later
        	Test:       	TestIntegrations/Discovery

zmb3 avatar Aug 18 '23 13:08 zmb3

Another one on master: https://github.com/gravitational/teleport/actions/runs/6027048272/job/16351354131

    integration_test.go:4044: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:4044
        	            				/__w/teleport/teleport/integration/integration_test.go:125
        	Error:      	Received unexpected error:
        	            	failed connecting to host 127.0.0.1:41815: failed to receive cluster details response
        	            		failed to dial target host
        	            		unable to fetch access point, this proxy clusterPeer(TunnelConnection(name=2f6a37a0-959d-474b-9a9b-325beb70a9fc-cluster-remote, type=, cluster=cluster-remote, proxy=2f6a37a0-959d-474b-9a9b-325beb70a9fc)) has not been discovered yet, try again later
        	Test:       	TestIntegrations/Discovery

zmb3 avatar Aug 30 '23 16:08 zmb3

v14: https://github.com/gravitational/teleport/actions/runs/6087350096/job/16515717921

zmb3 avatar Sep 05 '23 17:09 zmb3

different failure on master:

https://github.com/gravitational/teleport/actions/runs/6227775190/job/16903131172?pr=31916

    integration_test.go:4048: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:4052
        	            				/opt/go/src/runtime/asm_amd64.s:1650
        	Error:      	Expected value not to be nil.
    integration_test.go:4048: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:4048
        	            				/__w/teleport/teleport/integration/integration_test.go:123
        	Error:      	Condition never satisfied
        	Test:       	TestIntegrations/Discovery

GavinFrazar avatar Sep 18 '23 21:09 GavinFrazar

I got the same failure as Gavin on v14 https://github.com/gravitational/teleport/actions/runs/6422952356/job/17440476992

nklaassen avatar Oct 05 '23 18:10 nklaassen

https://github.com/gravitational/teleport/actions/runs/6565574843/job/17834398083

    integration_test.go:4187: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:6166
        	            				/opt/go/src/runtime/asm_amd64.s:1650
        	Error:      	Expected value not to be nil.
    integration_test.go:4187: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:6162
        	            				/__w/teleport/teleport/integration/integration_test.go:4187
        	            				/__w/teleport/teleport/integration/integration_test.go:123
        	Error:      	Condition never satisfied
        	Test:       	TestIntegrations/Discovery

zmb3 avatar Oct 18 '23 20:10 zmb3

https://github.com/gravitational/teleport/actions/runs/6579308041/job/17874902654

    integration_test.go:4187: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:6166
        	            				/opt/go/src/runtime/asm_amd64.s:1650
        	Error:      	Expected value not to be nil.
    integration_test.go:4187: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:6162
        	            				/__w/teleport/teleport/integration/integration_test.go:4187
        	            				/__w/teleport/teleport/integration/integration_test.go:123
        	Error:      	Condition never satisfied
        	Test:       	TestIntegrations/Discovery

rosstimothy avatar Oct 19 '23 19:10 rosstimothy

https://github.com/gravitational/teleport/actions/runs/6615009795/job/17966244734

zmb3 avatar Oct 23 '23 15:10 zmb3

https://github.com/gravitational/teleport/actions/runs/6829295469/job/18575152180#step:6:2694

ptgott avatar Nov 10 '23 20:11 ptgott

branch/v14:

integration_test.go:4267: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:6248
        	            				/opt/go/src/runtime/asm_amd64.s:1650
        	Error:      	Expected value not to be nil.
    integration_test.go:4267: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:6244
        	            				/__w/teleport/teleport/integration/integration_test.go:4267
        	            				/__w/teleport/teleport/integration/integration_test.go:126
        	Error:      	Condition never satisfied
        	Test:       	TestIntegrations/Discovery

rosstimothy avatar Nov 22 '23 14:11 rosstimothy

I ran into a new kind of error on branch/v14:

https://github.com/gravitational/teleport/actions/runs/6960808427/job/18941067310#step:6:2122

    integration_test.go:4267: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:6254
        	            				/opt/go/src/runtime/asm_amd64.s:1650
        	Error:      	Received unexpected error:
        	            	connection error: desc = "transport: Error while dialing: failed to dial: cluster cluster-remote is offline"
    integration_test.go:4267: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:6244
        	            				/__w/teleport/teleport/integration/integration_test.go:4267
        	            				/__w/teleport/teleport/integration/integration_test.go:126
        	Error:      	Condition never satisfied
        	Test:       	TestIntegrations/Discovery

ravicious avatar Nov 23 '23 09:11 ravicious

Failure from a PR rebased with the tip of master

=== FAIL: integration TestIntegrations/Discovery (20.69s)
    integration_test.go:4144: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:4144
        	            				/__w/teleport/teleport/integration/integration_test.go:124
        	Error:      	Received unexpected error:
        	            	failed connecting to host 127.0.0.1:45901: failed to receive cluster details response
        	            		failed to dial target host
        	            		direct dialing to nodes not found in inventory is not supported
        	Test:       	TestIntegrations/Discovery
    --- FAIL: TestIntegrations/Discovery (20.69s)

=== FAIL: integration TestIntegrations (719.38s)

https://github.com/gravitational/teleport/actions/runs/7020003123/job/19098756968?pr=34457

rosstimothy avatar Nov 28 '23 14:11 rosstimothy

Error on a branch based on branch/v14, it includes the backport of the recent fix (#34996).

https://github.com/gravitational/teleport/actions/runs/7031231504/job/19132398108#step:6:3561

    integration_test.go:4233: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:4233
        	            				/__w/teleport/teleport/integration/integration_test.go:126
        	Error:      	Received unexpected error:
        	            	failed connecting to host 127.0.0.1:34309: failed to receive cluster details response
        	            		failed to dial target host
        	            		unable to fetch node watcher, this proxy clusterPeer(TunnelConnection(name=00000000-0000-0000-0000-000000000000-cluster-remote, type=, cluster=cluster-remote, proxy=00000000-0000-0000-0000-000000000000)) has not been discovered yet, try again later
        	Test:       	TestIntegrations/Discovery

ravicious avatar Nov 29 '23 11:11 ravicious

Another in the merge queue to master https://github.com/gravitational/teleport/actions/runs/7090182023/job/19296583245

=== RUN   TestIntegrations/Discovery
    integration_test.go:4172: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:6155
        	            				/opt/go/src/runtime/asm_amd64.s:1650
        	Error:      	Expected value not to be nil.
    integration_test.go:4172: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:6151
        	            				/__w/teleport/teleport/integration/integration_test.go:4172
        	            				/__w/teleport/teleport/integration/integration_test.go:126
        	Error:      	Condition never satisfied
        	Test:       	TestIntegrations/Discovery
    --- FAIL: TestIntegrations/Discovery (50.00s)

nklaassen avatar Dec 04 '23 17:12 nklaassen

https://github.com/gravitational/teleport/actions/runs/7093375371/job/19306644012

zmb3 avatar Dec 04 '23 22:12 zmb3

https://github.com/gravitational/teleport/actions/runs/7115810275/job/19372821232

zmb3 avatar Dec 06 '23 14:12 zmb3

https://github.com/gravitational/teleport/actions/runs/7115810275/job/19372821232#step:6:1122

    integration_test.go:4146: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:4146
        	            				/__w/teleport/teleport/integration/integration_test.go:126
        	Error:      	Received unexpected error:
        	            	failed connecting to host 127.0.0.1:34889: failed to receive cluster details response
        	            		failed to dial target host
        	            		direct dialing to nodes not found in inventory is not supported
        	Test:       	TestIntegrations/Discovery

ravicious avatar Dec 06 '23 14:12 ravicious

v13

https://github.com/gravitational/teleport/actions/runs/7127815036/job/19408522862#step:6:1909

     integration_test.go:4121: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:4121
        	            				/__w/teleport/teleport/integration/integration_test.go:123
        	Error:      	Received unexpected error:
        	            	failed connecting to host 127.0.0.1:36773: failed to receive cluster details response
        	            		failed to dial target host
        	            		unable to fetch node watcher, this proxy clusterPeer(TunnelConnection(name=00000000-0000-0000-0000-000000000000-cluster-remote, type=, cluster=cluster-remote, proxy=00000000-0000-0000-0000-000000000000)) has not been discovered yet, try again later
        	Test:       	TestIntegrations/Discovery

ravicious avatar Dec 07 '23 12:12 ravicious

This is an issue again. Fresh hits on master:

  • https://github.com/gravitational/teleport/actions/runs/7628286356/job/20778980336?pr=36861
  • https://github.com/gravitational/teleport/actions/runs/7628726029/job/20780483486
=== Failed
=== FAIL: integration TestIntegrations/Discovery (47.02s)
    trustedclusters.go:148: 
        	Error Trace:	/__w/teleport/teleport/integration/helpers/trustedclusters.go:159
        	            				/opt/go/src/runtime/asm_amd64.s:1650
        	Error:      	Received unexpected error:
        	            	unable to fetch client, this proxy clusterPeer(TunnelConnection(name=63cc50e2-09e3-4d28-81b2-31e0542e0311-cluster-remote, type=, cluster=cluster-remote, proxy=63cc50e2-09e3-4d28-81b2-31e0542e0311)) has not been discovered yet, try again later
        	Messages:   	cluster not yet available
    trustedclusters.go:148: 
        	Error Trace:	/__w/teleport/teleport/integration/helpers/trustedclusters.go:148
        	            				/__w/teleport/teleport/integration/integration_test.go:4172
        	            				/__w/teleport/teleport/integration/integration_test.go:126
        	Error:      	Condition never satisfied
        	Test:       	TestIntegrations/Discovery
        	Messages:   	Active tunnel connections did not reach 1 in the expected time frame 30s
    --- FAIL: TestIntegrations/Discovery (47.02s)

=== FAIL: integration TestIntegrations (763.62s)

zmb3 avatar Jan 23 '24 16:01 zmb3

Another on master, which is preventing us from getting an important build fix out.

https://github.com/gravitational/teleport/actions/runs/7718393459/job/21039553375?pr=37568

zmb3 avatar Jan 30 '24 23:01 zmb3

This failed repeatedly for the v15.0.2 release, delaying it by a couple of hours

v14 release as well: https://github.com/gravitational/teleport/actions/runs/7937403150/job/21674512929

fheinecke avatar Feb 16 '24 21:02 fheinecke

I'll prioritize fixing this next week.

espadolini avatar Feb 16 '24 22:02 espadolini

grazie mille @espadolini 🙏

zmb3 avatar Feb 16 '24 23:02 zmb3

Extending the timeout in WaitForActiveTunnelConnections and logging some timestamps revealed that it can take longer than 30 seconds for tunnels to appear, depending on how unlucky we get with how many attempts the remote cluster agentpool takes before the expected reverse tunnel server is available and how unlucky we get while it's available.

This is intended behavior, since we want reconnection attempts to gradually back off after a small initial burst - however, it's very undesirable in tests.

With a very hacky change (3a51a830a0dc0cd3e6ad816066f8048b33be6876) I haven't seen a WaitForActiveTunnelConnections take more than 0.5 seconds(!), but the cpu usage during the test can spike quite a bit (I've seen agent lease IDs in the 50+).

I'm not entirely sure of what's the least worse way to make the agent pools more "aggressive" in our tests - and I'm not even sure if we truly want to, in integration tests - perhaps we should change Teleport itself to reconnect slightly more often even in actual builds, at least for the remote cluster agent pool?

espadolini avatar Feb 23 '24 18:02 espadolini