neon
neon copied to clipboard
"Cannot assign requested address" error in wal_acceptor
I got an unexpected CI failure, with this in wal_acceptor log: https://app.circleci.com/pipelines/github/zenithdb/zenith/1303/workflows/d5d95541-50e7-41e3-8a14-5ce9e7cacddb/jobs/7372
Aug 24 19:41:57.535 INFO Invalid WAL file 000000000000000000000001.partial magic 0
Aug 24 19:41:57.535 INFO Start replication from 0/1699D10 till 0/0
Aug 24 19:41:57.540 INFO Switch to new epoch 1
Aug 24 19:41:58.532 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:41:59.534 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:00.538 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:01.541 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:02.544 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:03.547 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:04.550 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:05.553 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:06.556 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:07.890 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:08.893 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:08.894 ERRO Failed to connect to pageserver postgresql://no_user:@localhost:64000/no_db: error connecting to server: Cannot assign requested address (os error 99)
Aug 24 19:42:09.894 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:09.896 ERRO Failed to connect to pageserver postgresql://no_user:@localhost:64000/no_db: error connecting to server: Cannot assign requested address (os error 99)
Aug 24 19:42:10.896 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Os error 99 is none other than EADDRNOTAVAIL
, per man 2 connect
:
EADDRNOTAVAIL
(Internet domain sockets) The socket referred to by
sockfd had not previously been bound to an address and,
upon attempting to bind it to an ephemeral port, it was
determined that all port numbers in the ephemeral port
range are currently in use. See the discussion of
/proc/sys/net/ipv4/ip_local_port_range in ip(7).
Happened again: https://app.circleci.com/pipelines/github/zenithdb/zenith/1746/workflows/999057a9-bf5a-4d3e-8613-584716a2bf6e/jobs/11086
Hm. The first link
https://app.circleci.com/pipelines/github/zenithdb/zenith/1303/workflows/d5d95541-50e7-41e3-8a14-5ce9e7cacddb/jobs/7372
points to 'pg_regress tests debug' green job where safekeepers are not involved, and in the second
https://app.circleci.com/pipelines/github/zenithdb/zenith/1746/workflows/999057a9-bf5a-4d3e-8613-584716a2bf6e/jobs/11086
test_race_conditions
failed, but I don't see Cannot assign requested address (os error 99)
. The failure is due to starting decoding on safekeeper from wrong point (marker for that is query handler for b"START_WAL_PUSH\0" failed: invalid xl_tot_len 0 at 0/10000A0
). Also I see there walproposer restarts (bgw's FATAL
), which shouldn't happened in this test at all. Both things -- decoding start point and wasteful restarts have been fixed since then.
I'm not sure where 'os error 99' comes from. The only direction I have in mind is some leakage of these callmemaybe reconnections, especially when walproposer -> safekeeper often restarts due to some other bug (e.g. above), but let's see this at least once more before speculating.
I researched this issue a bit, starting from analyzing logs in our CI pipelines. I downloaded logs for recent failed tests and searched for os error 99
in them. Here's a gist with log files containing that error.
===== ./21387_safekeeper-metrics-tests/test_restarts_under_load/pageserver.log
Oct 20 15:15:32.085 INFO WAL streaming connection failed (error connecting to server: Cannot assign requested address (os error 99)), retrying in 1 second
Oct 20 15:15:33.086 INFO WAL streaming connection failed (error connecting to server: Cannot assign requested address (os error 99)), retrying in 1 second
Oct 20 15:15:34.086 INFO WAL streaming connection failed (error connecting to server: Cannot assign requested address (os error 99)), retrying in 1 second
Oct 20 15:15:35.087 INFO WAL streaming connection failed (error connecting to server: Cannot assign requested address (os error 99)), retrying in 1 second
Oct 20 15:15:36.088 INFO WAL streaming connection failed (error connecting to server: Cannot assign requested address (os error 99)), retrying in 1 second
These logs contain info about test and CI run, and also lines in which error 99
was found (at most 5 lines). More specifically:
-
21387
is the job id in CircleCI -
safekeeper-metrics-tests
is the name of the branch -
test_restarts_under_load
is the name of the python test -
pageserver.log
is the name of the log file
As was mentioned, EADDRNOTAVAIL can happen when too many ports are used. In our case, CI executors have net.ipv4.ip_local_port_range = 32768 60999
, which is roughly 30k ports. So I wrote a script to see how many ports are used while running tests in CI. This script runs netstat every 500ms and saves output to netstat.stdout
file, which is available as an artifact in CI. I've done this in netstat-logs branch, here are pipelines on CircleCI.
In these runs with script enabled, four jobs are failed with error 99. This is how many ports were used according to logs:
- Job 20123, max 33438 tcp ports
- Job 20125, max 763 tcp ports
- Job 20145, max 14288 tcp ports
- Job 20149, max 820 tcp ports
Job 20123 seems connected to a bug fixed in https://github.com/zenithdb/postgres/pull/93. In this job, pg.log has a lot of WAL record crc mismatch
errors, which are likely happened because streaming starts from wrong point. This made postgres execute many START_WAL_PUSH operations, which made safekeeper spawn many request_callback threads, which made safekeeper flood pageserver with callmemaybe requests and doing this so fast that all 30k ports are used, staying in TIME_WAIT state. Nevertheless, this behavior seems to be "expected", as does error 99
in this case.
Other jobs look odd, because in them EADDRNOTAVAIL happens without use of all 30k ports. The reasons for this are still unknown. I suspect that error 99
can happen not only from port exhaustion, but from other errors inside kernel's connect() also. Another guess is that the ports can interfere with other CI jobs and 30k is the total limit for one physical machine, but I have no proofs for this.
My another guess is that EADDRNOTAVAIL can be caused by localhost -> 127.0.0.1 resolve somehow, and error 99
can be fixed by replacing localhost with 127.0.0.1 in all our tests. I also have no proofs of this, but we can try this if no other fix is found.
Another guess is that the ports can interfere with other CI jobs and 30k is the total limit for one physical machine, but I have no proofs for this.
+1. We use docker, which means that those CI jobs might use the same kernel instance.
Got another failure with the same error: https://github.com/neondatabase/neon/runs/7250242872?check_suite_focus=true
2022-07-08 11:20:44.809 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 0 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16153), [Errno 99] Cannot assign requested address
2022-07-08 11:20:44.809 INFO [neon_fixtures.py:1212] Running command "/tmp/neon/bin/neon_local pg stop --tenant-id f48c635d39e440c0a636f188f685a5a0 bgcompute1_key65536"
2022-07-08 11:20:44.809 INFO [neon_fixtures.py:1213] Running in "/tmp/test_output/test_concurrent_computes/repo"
2022-07-08 11:20:44.954 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 1 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16154), [Errno 99] Cannot assign requested address
2022-07-08 11:20:44.955 INFO [neon_fixtures.py:1212] Running command "/tmp/neon/bin/neon_local pg stop --tenant-id f48c635d39e440c0a636f188f685a5a0 bgcompute2_key131072"
2022-07-08 11:20:44.955 INFO [neon_fixtures.py:1213] Running in "/tmp/test_output/test_concurrent_computes/repo"
2022-07-08 11:20:45.103 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 2 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16155), [Errno 99] Cannot assign requested address
2022-07-08 11:20:45.104 INFO [neon_fixtures.py:1212] Running command "/tmp/neon/bin/neon_local pg stop --tenant-id f48c635d39e440c0a636f188f685a5a0 bgcompute3_key196608"
2022-07-08 11:20:45.104 INFO [neon_fixtures.py:1213] Running in "/tmp/test_output/test_concurrent_computes/repo"
2022-07-08 11:20:45.249 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 3 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16156), [Errno 99] Cannot assign requested address
2022-07-08 11:20:45.250 INFO [neon_fixtures.py:1212] Running command "/tmp/neon/bin/neon_local pg stop --tenant-id f48c635d39e440c0a636f188f685a5a0 bgcompute4_key262144"
2022-07-08 11:20:45.250 INFO [neon_fixtures.py:1213] Running in "/tmp/test_output/test_concurrent_computes/repo"
2022-07-08 11:20:45.397 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 4 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16157), [Errno 99] Cannot assign requested address
2022-07-08 11:20:45.398 INFO [neon_fixtures.py:1212] Running command "/tmp/neon/bin/neon_local pg stop --tenant-id f48c635d39e440c0a636f188f685a5a0 bgcompute5_key327680"
2022-07-08 11:20:45.398 INFO [neon_fixtures.py:1213] Running in "/tmp/test_output/test_concurrent_computes/repo"
2022-07-08 11:20:45.580 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 5 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16158), [Errno 99] Cannot assign requested address
2022-07-08 11:20:45.581 INFO [neon_fixtures.py:1212] Running command "/tmp/neon/bin/neon_local pg stop --tenant-id f48c635d39e440c0a636f188f685a5a0 bgcompute6_key393216"
2022-07-08 11:20:45.581 INFO [neon_fixtures.py:1213] Running in "/tmp/test_output/test_concurrent_computes/repo"
2022-07-08 11:20:45.738 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 6 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16159), [Errno 99] Cannot assign requested address
2022-07-08 11:20:45.738 INFO [neon_fixtures.py:1212] Running command "/tmp/neon/bin/neon_local pg stop --tenant-id f48c635d39e440c0a636f188f685a5a0 bgcompute7_key458752"
2022-07-08 11:20:45.738 INFO [neon_fixtures.py:1213] Running in "/tmp/test_output/test_concurrent_computes/repo"
2022-07-08 11:20:45.898 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 7 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16160), [Errno 99] Cannot assign requested address
From a different test this time, and from python rather than WAL receiver. But I think it's a similar resource exhaustio issue in the CI.
And a few more here: https://app.circleci.com/pipelines/github/neondatabase/neon/7822/workflows/71c87bd2-5707-40d0-aae1-4fc7993d09fe/jobs/80303, see tmp/test_output/test_unavailability/repo/pgdatadirs/tenants/57391c9e525f441bb18801c0dedaaf7d/test_safekeepers_unavailability_pg_node/pg.log
That was from the CI run for the same commit as above, but I don't think there was anything in the PR that could cause that.
Another one, from a docs-only PR: https://app.circleci.com/pipelines/github/neondatabase/neon/8150/workflows/ef2c3e04-1585-40c3-802b-7ebe7611c673/jobs/83441/steps
Another one, from a docs-only PR: https://app.circleci.com/pipelines/github/neondatabase/neon/8150/workflows/ef2c3e04-1585-40c3-802b-7ebe7611c673/jobs/83441/steps
Never mind, I this was different. I was looking at the postgres log, and it contained errors like this:
2022-07-18 10:49:03.595 GMT [18267] LOG: connecting with node localhost:17119
2022-07-18 10:49:03.595 GMT [18267] WARNING: failed to connect to node 'localhost:17119': connection to server at "localhost" (127.0.0.1), port 17119 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (127.0.0.1), port 17119 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (::1), port 17119 failed: Cannot assign requested address
Is the server running on that host and accepting TCP/IP connections?
But I think that was just IPv6 not working, which is OK. I'm not sure what caused the timeout in that CI run, though.
Got another example
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-2710/release/3386856615/index.html#suites/82004ab4e3720b47bf78f312dabe7c55/7f6280bcd3ecab01/
I think this is no longer relevant. Issue in the last comment is likely fixed by https://github.com/neondatabase/neon/pull/2984