falcon icon indicating copy to clipboard operation
falcon copied to clipboard

errors on production, socket fail

Open michelson opened this issue 1 year ago • 2 comments

Hi there, I have migrated some months ago to Falcon on rails app,

suddenly I've started to see weird errors which are not being notified on my reporting error, but users noted that some realtime (socket connection) failed.

the logs is like this:

{"time":"2024-10-17T02:38:15+00:00","severity":"warn","oid":3028920,"pid":1,"subject":"Async::Task","annotation":"Reading HTTP/1.1 requests for Async::HTTP::Protocol::HTTP1::Server.","message":["Task may have ended with unhandled exception.","undefined method `write' for nil:NilClass"],"error":{"kind":"NoMethodError","message":"undefined method `write' for nil:NilClass","stack":"/usr/local/bundle/gems/protocol-http1-0.18.0/lib/protocol/http1/connection.rb:134:in `write_response'\n/usr/local/bundle/gems/async-http-0.63.0/lib/async/http/protocol/http1/server.rb:63:in `each'\n/usr/local/bundle/gems/async-http-0.63.0/lib/async/http/server.rb:40:in `accept'\n/usr/local/bundle/gems/async-io-1.41.0/lib/async/io/server.rb:15:in `block in accept_each'\n/usr/local/bundle/gems/async-io-1.41.0/lib/async/io/socket.rb:58:in `block in accept'\n/usr/local/bundle/gems/async-2.8.2/lib/async/task.rb:161:in `block in run'\n/usr/local/bundle/gems/async-2.8.2/lib/async/task.rb:331:in `block in schedule'\n"}}

not sure what could happen here but seems to be async http gems

the installed dependences were:

    async (2.8.2)
      console (~> 1.10)
      fiber-annotation
      io-event (~> 1.1)
      timers (~> 4.1)
    async-container (0.16.13)
      async
      async-io
    async-http (0.63.0)
      async (>= 1.25)
      async-io (>= 1.28)
      async-pool (>= 0.2)
      protocol-http (~> 0.26.0)
      protocol-http1 (~> 0.18.0)
      protocol-http2 (~> 0.16.0)
      traces (>= 0.10.0)
    async-http-cache (0.4.3)
      async-http (~> 0.56)
    async-io (1.41.0)
      async
    async-pool (0.4.0)
      async (>= 1.25)

michelson avatar Oct 17 '24 02:10 michelson

Thanks for your report, I'll investigate it.

Probably a good first step would be to upgrade to the latest version of Ruby, async-http and falcon as well as all the other dependencies.

ioquatix avatar Oct 17 '24 03:10 ioquatix

When I upgraded to the latest Falcon, the Webrick started in development. Maybe I'm missing something

michelson avatar Oct 17 '24 12:10 michelson

@michelson can you show me exactly what command you are running?

ioquatix avatar Oct 28 '24 09:10 ioquatix

HI Samuel, rails s on development.

michelson avatar Oct 28 '24 13:10 michelson

Do you have both puma and falcon in your gemfile.lock?

ioquatix avatar Oct 28 '24 20:10 ioquatix

no, just Falcon on Gemfile. It also happened to me on a new rails app. steps:

  • remove puma
  • add falcon
  • rails s = webrick start.

michelson avatar Oct 28 '24 23:10 michelson

Okay, let me check what's going on.

ioquatix avatar Oct 29 '24 02:10 ioquatix

samuel@aiko ~/tmp> rails new foob
      create  
...
         run  bundle install --quiet
samuel@aiko ~/tmp> cd foob
samuel@aiko ~/t/foob (main)> bundle remove puma
Removing gems from /home/samuel/tmp/foob/Gemfile
puma (>= 5.0) was removed.
samuel@aiko ~/t/foob (main)> bundle add falcon
Fetching gem metadata from https://rubygems.org/.........
Resolving dependencies...
Fetching gem metadata from https://rubygems.org/.........
Resolving dependencies...
samuel@aiko ~/t/foob (main)> bin/rails s
=> Booting Falcon v0.48.3
=> Rails 7.2.1.2 application starting in development http://localhost:3000
=> Run `bin/rails server --help` for more startup options

At least with this simple example, I could not reproduce it. Do you mind showing me your output?

ioquatix avatar Oct 29 '24 02:10 ioquatix

also seeing similar errors:

{
  "time": "2025-03-06T23:10:24+00:00",
  "severity": "warn",
  "oid": 55408,
  "pid": 19,
  "subject": "Async::Task",
  "annotation": "Reading HTTP/1.1 requests for Async::HTTP::Protocol::HTTP1::Server.",
  "message": "Task may have ended with unhandled exception.",
  "event": {
    "type": "failure",
    "root": "/app",
    "class": "Errno::EPIPE",
    "message": "Broken pipe",
    "backtrace": [
      "/usr/local/lib/ruby/3.4.0/socket.rb:461:in 'BasicSocket#__write_nonblock'",
      "/usr/local/lib/ruby/3.4.0/socket.rb:461:in 'BasicSocket#write_nonblock'",
      "/usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/buffered.rb:93:in 'IO::Stream::Buffered#syswrite'",
      "/usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/generic.rb:184:in 'IO::Stream::Generic#drain'",
      "/usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/generic.rb:196:in 'block in IO::Stream::Generic#flush'",
      "/usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/generic.rb:195:in 'Thread::Mutex#synchronize'",
      "/usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/generic.rb:195:in 'IO::Stream::Generic#flush'",
      "/usr/local/bundle/gems/protocol-http1-0.30.0/lib/protocol/http1/connection.rb:527:in 'Protocol::HTTP1::Connection#write_chunked_body'",
      "/usr/local/bundle/gems/protocol-http1-0.30.0/lib/protocol/http1/connection.rb:598:in 'Protocol::HTTP1::Connection#write_body'",
      "/usr/local/bundle/gems/async-http-0.87.0/lib/async/http/protocol/http1/server.rb:133:in 'block in Async::HTTP::Protocol::HTTP1::Server#each'",
      "/usr/local/bundle/gems/async-2.23.0/lib/async/task.rb:330:in 'Async::Task#defer_stop'",
      "/usr/local/bundle/gems/async-http-0.87.0/lib/async/http/protocol/http1/server.rb:80:in 'Async::HTTP::Protocol::HTTP1::Server#each'",
      "/usr/local/bundle/gems/async-http-0.87.0/lib/async/http/server.rb:50:in 'Async::HTTP::Server#accept'",
      "/usr/local/bundle/gems/falcon-0.51.0/lib/falcon/server.rb:57:in 'Falcon::Server#accept'",
      "/usr/local/bundle/gems/io-endpoint-0.15.2/lib/io/endpoint/wrapper.rb:216:in 'block (2 levels) in IO::Endpoint::Wrapper#accept'",
      "/usr/local/bundle/gems/async-2.23.0/lib/async/task.rb:200:in 'block in Async::Task#run'",
      "/usr/local/bundle/gems/async-2.23.0/lib/async/task.rb:438:in 'block in Async::Task#schedule'"
    ]
  }
}

these must be in-bound requests. could it be from timeouts configured on the upstream server? we have faraday making requests from an upstream server to our falcon server with a 15 second timeout, and requests are expected to occasionally timeout based on slow requests to other servers downstream from the falcon server.

jakeonfire avatar Mar 06 '25 23:03 jakeonfire

Huh, I came in to report the exact same thing:

{
  "time": "2025-03-11T16:07:55+00:00",
  "severity": "warn",
  "oid": 90828100,
  "pid": 388,
  "subject": "Async::Task",
  "annotation": "Reading HTTP/1.1 requests for Async::HTTP::Protocol::HTTP1::Server.",
  "message": "Task may have ended with unhandled exception.",
  "event": {
    "type": "failure",
    "root": "/root/workspace/web",
    "class": "Errno::EPIPE",
    "message": "Broken pipe",
    "backtrace": [
      "/usr/local/lib/ruby/3.3.0/socket.rb:461:in `__write_nonblock'",
      "/usr/local/lib/ruby/3.3.0/socket.rb:461:in `write_nonblock'",
      "/usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/buffered.rb:93:in `syswrite'",
      "/usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/generic.rb:184:in `drain'",
      "/usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/generic.rb:196:in `block in flush'",
      "/usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/generic.rb:195:in `synchronize'",
      "/usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/generic.rb:195:in `flush'",
      "/usr/local/bundle/gems/protocol-http1-0.30.0/lib/protocol/http1/connection.rb:527:in `write_chunked_body'",
      "/usr/local/bundle/gems/protocol-http1-0.30.0/lib/protocol/http1/connection.rb:598:in `write_body'",
      "/usr/local/bundle/gems/async-http-0.87.0/lib/async/http/protocol/http1/server.rb:133:in `block in each'",
      "/usr/local/bundle/gems/async-2.23.1/lib/async/task.rb:330:in `defer_stop'",
      "/usr/local/bundle/gems/async-http-0.87.0/lib/async/http/protocol/http1/server.rb:80:in `each'",
      "/usr/local/bundle/gems/async-http-0.87.0/lib/async/http/server.rb:50:in `accept'",
      "/usr/local/bundle/gems/falcon-0.51.1/lib/falcon/server.rb:57:in `accept'",
      "/usr/local/bundle/gems/io-endpoint-0.15.2/lib/io/endpoint/wrapper.rb:216:in `block (2 levels) in accept'",
      "/usr/local/bundle/gems/async-2.23.1/lib/async/task.rb:200:in `block in run'",
      "/usr/local/bundle/gems/async-2.23.1/lib/async/task.rb:438:in `block in schedule'"
    ]
  }
}

What we're seeing is a seemingly random flurries of these errors. They seem random at this point (they don't seem to align with any kind of a burst in traffic) and during these events, CPU jumps up to near 100%, and for a few minutes we see these errors flood the logs. Most of the workers restart on their own and things move along until the whole thing happens again. During these events, Falcon seems to stop serving requests.

Like @jakeonfire we are also using Faraday, but not for all requests. Faraday (with the async-http adapter) is the HTTP client we're using for upstream search requests (OpenSearch).

I've not seen any difference between Ruby 3.3.7 or 3.4.2.

@ioquatix Do you have any idea what could be going on?

travisbell avatar Mar 11 '25 16:03 travisbell

Thanks for all the logs and data I will review this week.

ioquatix avatar Mar 11 '25 17:03 ioquatix

fwiw ours are fairly steady: Image

and there is no evidence that they are causing a bigger issue, strangely.

jakeonfire avatar Mar 11 '25 18:03 jakeonfire

We just had one of these events I was referring to. For us, it looks like this:

Image

And request times during this:

Image

travisbell avatar Mar 14 '25 16:03 travisbell

It turns out there is often also the following error during these spikes:

/usr/local/bundle/gems/async-2.23.1/lib/async/scheduler.rb:305:in 'IO::Event::Selector::EPoll#io_wait': execution expired (Faraday::TimeoutError)
	from /usr/local/bundle/gems/async-2.23.1/lib/async/scheduler.rb:305:in 'Async::Scheduler#io_wait'
	from /usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/buffered.rb:118:in 'IO#wait_readable'
	from /usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/buffered.rb:118:in 'IO::Stream::Buffered#sysread'
	from /usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/generic.rb:327:in 'IO::Stream::Generic#fill_read_buffer'
	from /usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/generic.rb:164:in 'IO::Stream::Generic#gets'
	from /usr/local/bundle/gems/protocol-http1-0.34.0/lib/protocol/http1/connection.rb:350:in 'Protocol::HTTP1::Connection#read_line?'
	from /usr/local/bundle/gems/protocol-http1-0.34.0/lib/protocol/http1/connection.rb:365:in 'Protocol::HTTP1::Connection#read_line'
	from /usr/local/bundle/gems/protocol-http1-0.34.0/lib/protocol/http1/connection.rb:424:in 'Protocol::HTTP1::Connection#read_response_line'
	from /usr/local/bundle/gems/protocol-http1-0.34.0/lib/protocol/http1/connection.rb:451:in 'Protocol::HTTP1::Connection#read_response'
	from /usr/local/bundle/gems/async-http-0.88.0/lib/async/http/protocol/http1/response.rb:15:in 'Async::HTTP::Protocol::HTTP1::Response.read'
	from /usr/local/bundle/gems/async-http-0.88.0/lib/async/http/protocol/http1/client.rb:86:in 'Async::HTTP::Protocol::HTTP1::Client#call'
	from /usr/local/bundle/gems/protocol-http-0.49.0/lib/protocol/http/request.rb:87:in 'Protocol::HTTP::Request#call'
	from /usr/local/bundle/gems/async-http-0.88.0/lib/async/http/client.rb:146:in 'Async::HTTP::Client#make_response'
	from /usr/local/bundle/gems/async-http-0.88.0/lib/async/http/client.rb:104:in 'Async::HTTP::Client#call'
	from /usr/local/bundle/gems/async-http-faraday-0.21.0/lib/async/http/faraday/adapter.rb:205:in 'block (2 levels) in Async::HTTP::Faraday::Adapter#perform_request'
	from /usr/local/bundle/gems/async-http-faraday-0.21.0/lib/async/http/faraday/adapter.rb:242:in 'block in Async::HTTP::Faraday::Adapter#with_timeout'
	from /usr/local/bundle/gems/async-2.23.1/lib/async/scheduler.rb:562:in 'Async::Scheduler#with_timeout'
	from /usr/local/bundle/gems/async-2.23.1/lib/async/task.rb:141:in 'Async::Task#with_timeout'
	from /usr/local/bundle/gems/async-http-faraday-0.21.0/lib/async/http/faraday/adapter.rb:241:in 'Async::HTTP::Faraday::Adapter#with_timeout'
	from /usr/local/bundle/gems/async-http-faraday-0.21.0/lib/async/http/faraday/adapter.rb:191:in 'block in Async::HTTP::Faraday::Adapter#perform_request'
	from /usr/local/bundle/gems/async-http-faraday-0.21.0/lib/async/http/faraday/adapter.rb:233:in 'block (2 levels) in Async::HTTP::Faraday::Adapter#with_client'
	from /usr/local/bundle/gems/async-http-faraday-0.21.0/lib/async/http/faraday/clients.rb:107:in 'Async::HTTP::Faraday::PersistentClients#with_client'
	from /usr/local/bundle/gems/async-http-faraday-0.21.0/lib/async/http/faraday/clients.rb:162:in 'Async::HTTP::Faraday::PerThreadPersistentClients#with_client'
	from /usr/local/bundle/gems/async-http-faraday-0.21.0/lib/async/http/faraday/adapter.rb:232:in 'block in Async::HTTP::Faraday::Adapter#with_client'
	from /usr/local/bundle/gems/async-2.23.1/lib/kernel/sync.rb:24:in 'Kernel#Sync'
	from /usr/local/bundle/gems/async-http-faraday-0.21.0/lib/async/http/faraday/adapter.rb:222:in 'Async::HTTP::Faraday::Adapter#with_client'
	from /usr/local/bundle/gems/async-http-faraday-0.21.0/lib/async/http/faraday/adapter.rb:170:in 'Async::HTTP::Faraday::Adapter#perform_request'
	from /usr/local/bundle/gems/async-http-faraday-0.21.0/lib/async/http/faraday/adapter.rb:161:in 'Async::HTTP::Faraday::Adapter#call'
	from /usr/local/bundle/gems/opentelemetry-instrumentation-faraday-0.26.0/lib/opentelemetry/instrumentation/faraday/middlewares/tracer_middleware.rb:43:in 'block (2 levels) in OpenTelemetry::Instrumentation::Faraday::Middlewares::TracerMiddleware#call'
	from /usr/local/bundle/gems/opentelemetry-api-1.5.0/lib/opentelemetry/trace/tracer.rb:37:in 'block in OpenTelemetry::Trace::Tracer#in_span'
	from /usr/local/bundle/gems/opentelemetry-api-1.5.0/lib/opentelemetry/trace.rb:70:in 'block in OpenTelemetry::Trace#with_span'
	from /usr/local/bundle/gems/opentelemetry-api-1.5.0/lib/opentelemetry/context.rb:88:in 'OpenTelemetry::Context.with_value'
	from /usr/local/bundle/gems/opentelemetry-api-1.5.0/lib/opentelemetry/trace.rb:70:in 'OpenTelemetry::Trace#with_span'
	from /usr/local/bundle/gems/opentelemetry-api-1.5.0/lib/opentelemetry/trace/tracer.rb:37:in 'OpenTelemetry::Trace::Tracer#in_span'
	from /usr/local/bundle/gems/opentelemetry-instrumentation-faraday-0.26.0/lib/opentelemetry/instrumentation/faraday/middlewares/tracer_middleware.rb:38:in 'block in OpenTelemetry::Instrumentation::Faraday::Middlewares::TracerMiddleware#call'
	from /usr/local/bundle/gems/opentelemetry-common-0.22.0/lib/opentelemetry/common/http/client_context.rb:49:in 'block in OpenTelemetry::Common::HTTP::ClientContext#with_attributes'
	from /usr/local/bundle/gems/opentelemetry-api-1.5.0/lib/opentelemetry/context.rb:88:in 'OpenTelemetry::Context.with_value'
	from /usr/local/bundle/gems/opentelemetry-common-0.22.0/lib/opentelemetry/common/http/client_context.rb:49:in 'OpenTelemetry::Common::HTTP::ClientContext#with_attributes'
	from /usr/local/bundle/gems/opentelemetry-instrumentation-faraday-0.26.0/lib/opentelemetry/instrumentation/faraday/middlewares/tracer_middleware.rb:37:in 'OpenTelemetry::Instrumentation::Faraday::Middlewares::TracerMiddleware#call'
	from /usr/local/bundle/gems/faraday-2.12.2/lib/faraday/middleware.rb:56:in 'Faraday::Middleware#call'
	from /usr/local/bundle/gems/faraday-retry-2.2.1/lib/faraday/retry/middleware.rb:153:in 'Faraday::Retry::Middleware#call'
	from /usr/local/bundle/gems/faraday-gzip-3.0.3/lib/faraday/gzip/middleware.rb:42:in 'Faraday::Gzip::Middleware#call'
	from /usr/local/bundle/gems/faraday-2.12.2/lib/faraday/rack_builder.rb:152:in 'Faraday::RackBuilder#build_response'
	from /usr/local/bundle/gems/faraday-2.12.2/lib/faraday/connection.rb:452:in 'Faraday::Connection#run_request'
	from /usr/local/bundle/gems/faraday-2.12.2/lib/faraday/connection.rb:280:in 'Faraday::Connection#post'

It's kind of hard to tell which error comes first though, or if they are even truly connected. I thought it might be helpful to see it as well though.

travisbell avatar Mar 25 '25 17:03 travisbell

I think what @jakeonfire and I found while troubleshooting it further is that these errors come up when the server making the inbound request to Falcon closes the connection on a timeout. In our case we were catching external web hooks and doing an unreasonable amount of processing before responding with a 200. Since the processing was of no interest to the caller we moved it to an async block and started responding right away. I believe the errors stopped immediately after that.

artrybalko avatar Mar 26 '25 16:03 artrybalko

I've also been continuing to troubleshoot this and found out a few more things as well. I agree with @artrybalko in that we see the original error (Errno::EPIPE) when upstream clients close their connection before getting a response.

In my original post I said I couldn't trace this to a burst of requests but I think that was incorrect. There actually are bursts it's just in such a small time span I wasn't really seeing them in my metrics. When I say short timespan, I'm talking like in 1 second. Thousands of new connections and requests for a single second.

The spikes seem to be enough to overwhelm Falcon (and now I'm thinking this is really the OS and epoll?) and at that point, it's just a cascade of failures. Falcon workers restarting but there's still IO contention, so they aren't always coming back up, upstream load balancers closing thousands of connections (before Falcon has responded to them), and it's just a mess.

Now I'm thinking this is much closer to the problem mentioned in #212.

For now, I've thrown Nginx in front with some burtstable rate limiting (with delays) to help smooth traffic spikes like this out. So far it's worked great. Today is the first day in a month where we didn't have an issue.

@ioquatix Do I understand that Falcon's current design will basically try to serve absolutely everything that gets thrown to it, regardless of the state of the system? Even to the point of basically taking itself down? Does what I describe, and the errors I'm seeing make sense to you?

travisbell avatar Mar 27 '25 21:03 travisbell

The original issue posted by @michelson is with this code:

class Protocol::HTTP1::Connection
  # ...
  def write_response(version, status, headers, reason = Reason::DESCRIPTIONS[status])
    # Safari WebSockets break if no reason is given:
    @stream.write("#{version} #{status} #{reason}\r\n")
    
    write_headers(headers)
  end			

In the latest version of this code, we now have a state machine which validates the behaviour:

class Protocol::HTTP1::Connection
  # ...
  def close(error = nil)
    @persistent = false
    
    if stream = @stream
      @stream = nil
      stream.close
    end
    
    unless closed?
      @state = :closed
      self.closed(error)
    end
  end
  
  # ...
  def write_response(version, status, headers, reason = Reason::DESCRIPTIONS[status])
    unless @state == :open or @state == :half_closed_remote
      raise ProtocolError, "Cannot write response in state: #{@state}!"
    end
    
    # Safari WebSockets break if no reason is given:
    @stream.write("#{version} #{status} #{reason}\r\n")
    
    write_headers(headers)
  end

So the original issue should now at worst fail with a Protocol::HTTP1::ProtocolError and ideally the issue is actually handled correctly earlier on as the state machine is much more comprehensive at dealing with client shenanigans.

ioquatix avatar Apr 27 '25 23:04 ioquatix

The initial issues reported by @jakeonfire and @travisbell are caused by clients disconnecting before receiving the full response body. This leads to a "broken pipe" Errno::EPIPE error when the server attempts to write to a socket that the remote end has closed.

{
  "time": "2025-03-11T16:07:55+00:00",
  "severity": "warn",
  "oid": 90828100,
  "pid": 388,
  "subject": "Async::Task",
  "annotation": "Reading HTTP/1.1 requests for Async::HTTP::Protocol::HTTP1::Server.",
  "message": "Task may have ended with unhandled exception.",
  "event": {
    "type": "failure",
    "root": "/root/workspace/web",
    "class": "Errno::EPIPE",
    "message": "Broken pipe",
    "backtrace": [
      "/usr/local/lib/ruby/3.3.0/socket.rb:461:in `__write_nonblock'",
      "/usr/local/lib/ruby/3.3.0/socket.rb:461:in `write_nonblock'",
      "/usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/buffered.rb:93:in `syswrite'",
      "/usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/generic.rb:184:in `drain'",
      "/usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/generic.rb:196:in `block in flush'",
      "/usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/generic.rb:195:in `synchronize'",
      "/usr/local/bundle/gems/io-stream-0.6.1/lib/io/stream/generic.rb:195:in `flush'",
      "/usr/local/bundle/gems/protocol-http1-0.30.0/lib/protocol/http1/connection.rb:527:in `write_chunked_body'",
      "/usr/local/bundle/gems/protocol-http1-0.30.0/lib/protocol/http1/connection.rb:598:in `write_body'",
      "/usr/local/bundle/gems/async-http-0.87.0/lib/async/http/protocol/http1/server.rb:133:in `block in each'",
      "/usr/local/bundle/gems/async-2.23.1/lib/async/task.rb:330:in `defer_stop'",
      "/usr/local/bundle/gems/async-http-0.87.0/lib/async/http/protocol/http1/server.rb:80:in `each'",
      "/usr/local/bundle/gems/async-http-0.87.0/lib/async/http/server.rb:50:in `accept'",
      "/usr/local/bundle/gems/falcon-0.51.1/lib/falcon/server.rb:57:in `accept'",
      "/usr/local/bundle/gems/io-endpoint-0.15.2/lib/io/endpoint/wrapper.rb:216:in `block (2 levels) in accept'",
      "/usr/local/bundle/gems/async-2.23.1/lib/async/task.rb:200:in `block in run'",
      "/usr/local/bundle/gems/async-2.23.1/lib/async/task.rb:438:in `block in schedule'"
    ]
  }
}

In more detail: An Errno::EPIPE ("Broken pipe") error simply indicates that the client (remote peer) closed the TCP connection on their side mid-response​. This typically happens if the client disconnects (e.g. by sending a TCP FIN or triggering a TCP reset) before it has received the full HTTP/1.1 response. In other words, the server attempted to write to the socket, but the underlying TCP stack reported that the remote side has gone away (the socket was closed by the peer, not by the server). Such broken-pipe errors are common in real-world web environments – for example, when a user navigates away from a page early or a proxy times out a slow request – and they generally do not indicate a bug in the server.

Do I understand that Falcon's current design will basically try to serve absolutely everything that gets thrown to it, regardless of the state of the system? Even to the point of basically taking itself down? Does what I describe, and the errors I'm seeing make sense to you?

Falcon mixes the accept call with the event loop. If the event loop gets saturated, the accept call will be invoked less frequently. This leads to the least loaded falcon worker typically accepting the inbound connection first. This generally distributes connections evenly across a single container of falcon servers. However, you'll still be depending on your load balancer to do the right thing if you have multiple containers on different network interfaces. In principle, falcon will eventually accept every connection that is given to it (so you are in principle correct), so it would be on the infrastructure to scale up once saturated. We now report a metric for this in the process title:

Image

The string is documented here. But in our case, we are interested in L=... which indicates the scheduler load. If you see this getting above 0.8 for a sustained amount of time (e.g. more than a few seconds) you may be running close to saturation.

On good hardware, Falcon can easily handle ~50,000 requests per second per event loop. So it's unlikely that Falcon itself is going to cause saturation in typical scenarios, it will be your application code that is essentially getting overloaded. In that case, we recently introduced the fiber-profiler to help detect poorly performing code which can stall the event loop. This isn't the only class of bugs but it would be the biggest one and perhaps one of the easiest ones to address. Be aware that in it's full tracing mode it can incur significant overhead, but if you can run it in a staging environment or shadow environment it will tell you any code paths which are causing contention on the event loop.

It does sound like there are issues here but it's not entirely clear the original issue and the follow up discussion are related. Therefore, as the original issue has been addressed, I'm going to close this issue, but if you feel that you are still having problems, please feel free to open a new issue and I am happy to continue the discussion.

(Also you might be interested in https://github.com/socketry/falcon/issues/154#issuecomment-2832911337 too).

ioquatix avatar Apr 28 '25 00:04 ioquatix

Okay, I've been investigating this issue, and noticed something in at least one scenario where EPIPE was showing up. The response was 304 Not Modified but it also included a body which is in violation of the HTTP RFCs.

As soon as we flush the response with transfer-encoding: chunked, the remote end closes the connection because the response is invalid.

I might try to fix this but will issue a warning too as this is invalid, e.g. [304, {}, ["something"]] is invalid. I'm not sure the exact nature of the response yet, but I thought I'd mention it here.

ioquatix avatar Apr 29 '25 02:04 ioquatix

Tracking here: https://github.com/socketry/falcon/issues/289

ioquatix avatar Apr 29 '25 02:04 ioquatix