neon icon indicating copy to clipboard operation
neon copied to clipboard

"Cannot assign requested address" error in wal_acceptor

Open hlinnaka opened this issue 3 years ago • 9 comments

I got an unexpected CI failure, with this in wal_acceptor log: https://app.circleci.com/pipelines/github/zenithdb/zenith/1303/workflows/d5d95541-50e7-41e3-8a14-5ce9e7cacddb/jobs/7372

Aug 24 19:41:57.535 INFO Invalid WAL file 000000000000000000000001.partial magic 0
Aug 24 19:41:57.535 INFO Start replication from 0/1699D10 till 0/0
Aug 24 19:41:57.540 INFO Switch to new epoch 1
Aug 24 19:41:58.532 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:41:59.534 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:00.538 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:01.541 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:02.544 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:03.547 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:04.550 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:05.553 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:06.556 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:07.890 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:08.893 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:08.894 ERRO Failed to connect to pageserver postgresql://no_user:@localhost:64000/no_db: error connecting to server: Cannot assign requested address (os error 99)
Aug 24 19:42:09.894 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'
Aug 24 19:42:09.896 ERRO Failed to connect to pageserver postgresql://no_user:@localhost:64000/no_db: error connecting to server: Cannot assign requested address (os error 99)
Aug 24 19:42:10.896 INFO requesting page server to connect to us: start postgresql://no_user:@localhost:64000/no_db callmemaybe ff4af1daab9b0ac7b6e87c2115ad20a3 fb5e67b5e2d202b0e3bca47dee1483f8 host=localhost port=54321 options='-c ztimelineid=fb5e67b5e2d202b0e3bca47dee1483f8'

hlinnaka avatar Aug 24 '21 21:08 hlinnaka

Os error 99 is none other than EADDRNOTAVAIL, per man 2 connect:

EADDRNOTAVAIL
      (Internet  domain  sockets)  The  socket  referred to by
      sockfd had not previously been bound to an address  and,
      upon  attempting to bind it to an ephemeral port, it was
      determined that all port numbers in the  ephemeral  port
      range  are  currently  in  use.   See  the discussion of
      /proc/sys/net/ipv4/ip_local_port_range in ip(7).

funbringer avatar Aug 25 '21 13:08 funbringer

Happened again: https://app.circleci.com/pipelines/github/zenithdb/zenith/1746/workflows/999057a9-bf5a-4d3e-8613-584716a2bf6e/jobs/11086

hlinnaka avatar Sep 13 '21 09:09 hlinnaka

Hm. The first link https://app.circleci.com/pipelines/github/zenithdb/zenith/1303/workflows/d5d95541-50e7-41e3-8a14-5ce9e7cacddb/jobs/7372 points to 'pg_regress tests debug' green job where safekeepers are not involved, and in the second https://app.circleci.com/pipelines/github/zenithdb/zenith/1746/workflows/999057a9-bf5a-4d3e-8613-584716a2bf6e/jobs/11086 test_race_conditions failed, but I don't see Cannot assign requested address (os error 99). The failure is due to starting decoding on safekeeper from wrong point (marker for that is query handler for b"START_WAL_PUSH\0" failed: invalid xl_tot_len 0 at 0/10000A0). Also I see there walproposer restarts (bgw's FATAL), which shouldn't happened in this test at all. Both things -- decoding start point and wasteful restarts have been fixed since then.

I'm not sure where 'os error 99' comes from. The only direction I have in mind is some leakage of these callmemaybe reconnections, especially when walproposer -> safekeeper often restarts due to some other bug (e.g. above), but let's see this at least once more before speculating.

arssher avatar Sep 20 '21 13:09 arssher

I researched this issue a bit, starting from analyzing logs in our CI pipelines. I downloaded logs for recent failed tests and searched for os error 99 in them. Here's a gist with log files containing that error.

===== ./21387_safekeeper-metrics-tests/test_restarts_under_load/pageserver.log
  Oct 20 15:15:32.085  INFO WAL streaming connection failed (error connecting to server: Cannot assign requested address (os error 99)), retrying in 1 second
  Oct 20 15:15:33.086  INFO WAL streaming connection failed (error connecting to server: Cannot assign requested address (os error 99)), retrying in 1 second
  Oct 20 15:15:34.086  INFO WAL streaming connection failed (error connecting to server: Cannot assign requested address (os error 99)), retrying in 1 second
  Oct 20 15:15:35.087  INFO WAL streaming connection failed (error connecting to server: Cannot assign requested address (os error 99)), retrying in 1 second
  Oct 20 15:15:36.088  INFO WAL streaming connection failed (error connecting to server: Cannot assign requested address (os error 99)), retrying in 1 second

These logs contain info about test and CI run, and also lines in which error 99 was found (at most 5 lines). More specifically:

  • 21387 is the job id in CircleCI
  • safekeeper-metrics-tests is the name of the branch
  • test_restarts_under_load is the name of the python test
  • pageserver.log is the name of the log file

As was mentioned, EADDRNOTAVAIL can happen when too many ports are used. In our case, CI executors have net.ipv4.ip_local_port_range = 32768 60999, which is roughly 30k ports. So I wrote a script to see how many ports are used while running tests in CI. This script runs netstat every 500ms and saves output to netstat.stdout file, which is available as an artifact in CI. I've done this in netstat-logs branch, here are pipelines on CircleCI.

In these runs with script enabled, four jobs are failed with error 99. This is how many ports were used according to logs:

Job 20123 seems connected to a bug fixed in https://github.com/zenithdb/postgres/pull/93. In this job, pg.log has a lot of WAL record crc mismatch errors, which are likely happened because streaming starts from wrong point. This made postgres execute many START_WAL_PUSH operations, which made safekeeper spawn many request_callback threads, which made safekeeper flood pageserver with callmemaybe requests and doing this so fast that all 30k ports are used, staying in TIME_WAIT state. Nevertheless, this behavior seems to be "expected", as does error 99 in this case.

Other jobs look odd, because in them EADDRNOTAVAIL happens without use of all 30k ports. The reasons for this are still unknown. I suspect that error 99 can happen not only from port exhaustion, but from other errors inside kernel's connect() also. Another guess is that the ports can interfere with other CI jobs and 30k is the total limit for one physical machine, but I have no proofs for this.

My another guess is that EADDRNOTAVAIL can be caused by localhost -> 127.0.0.1 resolve somehow, and error 99 can be fixed by replacing localhost with 127.0.0.1 in all our tests. I also have no proofs of this, but we can try this if no other fix is found.

petuhovskiy avatar Oct 20 '21 20:10 petuhovskiy

Another guess is that the ports can interfere with other CI jobs and 30k is the total limit for one physical machine, but I have no proofs for this.

+1. We use docker, which means that those CI jobs might use the same kernel instance.

funbringer avatar Oct 26 '21 10:10 funbringer

Got another failure with the same error: https://github.com/neondatabase/neon/runs/7250242872?check_suite_focus=true

2022-07-08 11:20:44.809 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 0 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16153), [Errno 99] Cannot assign requested address
2022-07-08 11:20:44.809 INFO [neon_fixtures.py:1212] Running command "/tmp/neon/bin/neon_local pg stop --tenant-id f48c635d39e440c0a636f188f685a5a0 bgcompute1_key65536"
2022-07-08 11:20:44.809 INFO [neon_fixtures.py:1213] Running in "/tmp/test_output/test_concurrent_computes/repo"
2022-07-08 11:20:44.954 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 1 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16154), [Errno 99] Cannot assign requested address
2022-07-08 11:20:44.955 INFO [neon_fixtures.py:1212] Running command "/tmp/neon/bin/neon_local pg stop --tenant-id f48c635d39e440c0a636f188f685a5a0 bgcompute2_key131072"
2022-07-08 11:20:44.955 INFO [neon_fixtures.py:1213] Running in "/tmp/test_output/test_concurrent_computes/repo"
2022-07-08 11:20:45.103 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 2 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16155), [Errno 99] Cannot assign requested address
2022-07-08 11:20:45.104 INFO [neon_fixtures.py:1212] Running command "/tmp/neon/bin/neon_local pg stop --tenant-id f48c635d39e440c0a636f188f685a5a0 bgcompute3_key196608"
2022-07-08 11:20:45.104 INFO [neon_fixtures.py:1213] Running in "/tmp/test_output/test_concurrent_computes/repo"
2022-07-08 11:20:45.249 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 3 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16156), [Errno 99] Cannot assign requested address
2022-07-08 11:20:45.250 INFO [neon_fixtures.py:1212] Running command "/tmp/neon/bin/neon_local pg stop --tenant-id f48c635d39e440c0a636f188f685a5a0 bgcompute4_key262144"
2022-07-08 11:20:45.250 INFO [neon_fixtures.py:1213] Running in "/tmp/test_output/test_concurrent_computes/repo"
2022-07-08 11:20:45.397 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 4 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16157), [Errno 99] Cannot assign requested address
2022-07-08 11:20:45.398 INFO [neon_fixtures.py:1212] Running command "/tmp/neon/bin/neon_local pg stop --tenant-id f48c635d39e440c0a636f188f685a5a0 bgcompute5_key327680"
2022-07-08 11:20:45.398 INFO [neon_fixtures.py:1213] Running in "/tmp/test_output/test_concurrent_computes/repo"
2022-07-08 11:20:45.580 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 5 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16158), [Errno 99] Cannot assign requested address
2022-07-08 11:20:45.581 INFO [neon_fixtures.py:1212] Running command "/tmp/neon/bin/neon_local pg stop --tenant-id f48c635d39e440c0a636f188f685a5a0 bgcompute6_key393216"
2022-07-08 11:20:45.581 INFO [neon_fixtures.py:1213] Running in "/tmp/test_output/test_concurrent_computes/repo"
2022-07-08 11:20:45.738 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 6 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16159), [Errno 99] Cannot assign requested address
2022-07-08 11:20:45.738 INFO [neon_fixtures.py:1212] Running command "/tmp/neon/bin/neon_local pg stop --tenant-id f48c635d39e440c0a636f188f685a5a0 bgcompute7_key458752"
2022-07-08 11:20:45.738 INFO [neon_fixtures.py:1213] Running in "/tmp/test_output/test_concurrent_computes/repo"
2022-07-08 11:20:45.898 INFO [test_wal_acceptor_async.py:338] BackgroundCompute 7 query failed: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 16160), [Errno 99] Cannot assign requested address

From a different test this time, and from python rather than WAL receiver. But I think it's a similar resource exhaustio issue in the CI.

hlinnaka avatar Jul 08 '22 11:07 hlinnaka

And a few more here: https://app.circleci.com/pipelines/github/neondatabase/neon/7822/workflows/71c87bd2-5707-40d0-aae1-4fc7993d09fe/jobs/80303, see tmp/test_output/test_unavailability/repo/pgdatadirs/tenants/57391c9e525f441bb18801c0dedaaf7d/test_safekeepers_unavailability_pg_node/pg.log

That was from the CI run for the same commit as above, but I don't think there was anything in the PR that could cause that.

hlinnaka avatar Jul 08 '22 12:07 hlinnaka

Another one, from a docs-only PR: https://app.circleci.com/pipelines/github/neondatabase/neon/8150/workflows/ef2c3e04-1585-40c3-802b-7ebe7611c673/jobs/83441/steps

hlinnaka avatar Jul 18 '22 11:07 hlinnaka

Another one, from a docs-only PR: https://app.circleci.com/pipelines/github/neondatabase/neon/8150/workflows/ef2c3e04-1585-40c3-802b-7ebe7611c673/jobs/83441/steps

Never mind, I this was different. I was looking at the postgres log, and it contained errors like this:

2022-07-18 10:49:03.595 GMT [18267] LOG:  connecting with node localhost:17119
2022-07-18 10:49:03.595 GMT [18267] WARNING:  failed to connect to node 'localhost:17119': connection to server at "localhost" (127.0.0.1), port 17119 failed: Connection refused
		Is the server running on that host and accepting TCP/IP connections?
	connection to server at "localhost" (127.0.0.1), port 17119 failed: Connection refused
		Is the server running on that host and accepting TCP/IP connections?
	connection to server at "localhost" (::1), port 17119 failed: Cannot assign requested address
		Is the server running on that host and accepting TCP/IP connections?

But I think that was just IPv6 not working, which is OK. I'm not sure what caused the timeout in that CI run, though.

hlinnaka avatar Jul 18 '22 11:07 hlinnaka

Got another example

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-2710/release/3386856615/index.html#suites/82004ab4e3720b47bf78f312dabe7c55/7f6280bcd3ecab01/

LizardWizzard avatar Nov 03 '22 16:11 LizardWizzard

I think this is no longer relevant. Issue in the last comment is likely fixed by https://github.com/neondatabase/neon/pull/2984

arssher avatar Apr 19 '23 05:04 arssher