falcon Falcon for large edge includes (developing/applying IO strategy)

Thank you for exploring async in Ruby! We are currently looking at raising our throughput on our download servers (if you are curious there is a presentation about it here https://speakerdeck.com/julik/streaming-large-files-with-ruby - specifically see slide 16

[libCURL get to *FILE] -> [Ruby File object] -> [non-blocking sendfile + IO#wait when EAGAIN]

This is repeated multiple times to "splice" response bodies together and works really well except that one response served this way consumes an entire thread. Async and fibers seem to be a good way to approach this problem, and nio4r also seems to be a great option because I can leverage the ByteBuffer implementation and the IO reactor. But this is where the question obviously arises. In our scheme we have at the moment (see above) there are two elements which are crucial - that we download upstream data into a file buffer (in reality it is on a RAM filesystem) which is not in the Ruby heap. We then tell the kernel to write that file into the socket that services our client on the Puma side, and again no bytes enter the Ruby heap. In practice it means that we have next to no allocation overhead during streaming, regardless of the size of the workload / number of upstream requests we need to perform. nio4r supports this usage pattern if you use raw sockets and a ByteBuffer, from what I understood in the docs something like this

buf = ByteBuffer.new(capacity: 24*1024)
while ... do # check for EOF etc
  buf.read_from(file) # If I get a 0 as return value I could call task.yield here (unlinkely)
  buf.flip
  buf.write_to(socket)  # If I get a 0 as return value I could call task.yield here as well (very likely)
  buf.flip
end

This would allow reuse of the ByteBuffer and while not as efficient as sendfile()- it would become read(infd, buf, size); write(outfd, buf, size) it would still allow us to accomplish what we need (not introducing these heaps of data into the Ruby string heap).

I have taken a peek at the beers.rb example and what I could envision is something like this if I reproduce it:

def stream_through(output_body, task, url, headers_for_upstream)
  # Obtuse URL management is not nice :-(
  uri = URI(url)
  base_uri = uri.dup
  base_uri.path = ""
  base_uri.query = nil

  endpoint = Async::HTTP::URLEndpoint.parse(base_uri.to_s)
  client = Async::HTTP::Client.new(endpoint)
  request = Async::HTTP::Request.new(client.scheme, uri.hostname, "GET", uri.request_uri, headers_for_upstream)
  response = client.call(request)
  if (200..2006).cover?(response.status) && body = response.body
    while chunk = body.read
      output_body.write(chunk)
      task.yield # Inserted for completeness sake
    end
  end
  response.close
end

run do |env|
    known_content_length = ... # I do know it in advance
    remote_urls_to_splice = ... # as well as these
    
    current_task = Async::Task.current
    async_output_body = Async::HTTP::Body::Writable.new
    current_task.async do |task|
      remote_urls_to_splice.each do |url_string|
        stream_through(async_output_body, task, url_string, {})
      end
    ensure
      async_output_body.close
    end
    
    [200, {'Content-Length' => known_content_length}, async_output_body]
  end
end

The problem that we have is that this will work well when the data transfer is relatively low-volume (chats, websockets etc.). But for us it will immediately blow up the Ruby heap with strings, since Async::HTTP::Body::Writable from what I can see is basically a "messagebox" (channel) for Strings. Mem use will be probably similar to what you could achieve with Rack's #each on a streaming body yielding Strings (we tried, it is immense and the application doesn't fit in RAM very quickly). What I want to do instead is pass the lowest-level possible objects to the reactor and tell it "Dear reactor, please use this fixed buffer to copy N bytes from this fd to that fd, and if there is an EAGAIN yield and try again later". But if both my "upstream" response body and my writable server response body are messgeboxes for Strings this option doesn't actually exist right?

Strictly speaking - yes, I am looking for an alternative to a non-blocking splice(). I can have plain (no-SSL) upstreams if that makes the job easier, I can also omit the "buffer to file first" if the rest of the setup works well. Everything in the setup is strictly HTTP1.1 at this point and the previous implementation even used HTTP1.0 for simplicity's sake.

So the question is, I guess - is this kind of workload a fit for falcon? Is it a good fit for nio4r? I do have the feeling that orchestrating these large-volume IO ops with Ruby should be perfectly feasible but when I examine examples and involved collaborator modules all I see are Strings, Strings, Strings... (primarily async/http). Is there some kind of wrapper around the nio4r ByteBuffer maybe that I could use as the async response body instead maybe?..

Maybe somehow get access to the actual output socket Falcon sets up (a-la Rack hijack) and perform non-blocking IO on that socket manually via nio4r?

I believe this is intimately related to https://github.com/socketry/falcon/issues/7 among others.

Edit: or if there is something I could use to no-string-copy from the async-http client body to the writable body of my HTTP response that could work too 🤔

Dec 08 '18 15:12 julik

I think you should try it first and then see if memory usage/strings allocation is an issue. We've already done some optimisation in this area (minimising string allocations). Once you've figured out specific code paths that are causing junk to be allocated, it could be raised as an issue.

Regarding splicing from input to output, it's not protocol agnostic especially w.r.t. HTTP/2.

That being said, maybe there is room for a HTTP/1 specific code path which suits your requirements.

NIO4R byte buffer hasn't been too useful in practice, but maybe we could make that work better if we know specifically what parts aren't up to scratch.

Dec 08 '18 21:12 ioquatix

👍 Thanks, we will do some stress testing.

Dec 09 '18 15:12 julik

Did you make any progress on this?

Dec 20 '18 00:12 ioquatix

We did. We ran falcon with 6 worker processes, putting it behind nginx on one of our production instances (where puma used to run instead). We had to switch to a TCP socket from a unix socket that Puma uses for that.

I also had to implement a backstop for the async backpressure issue which would otherwise destroy us, something like this

  def wait_for_queue_throughput(output_body, max_queue_items_pending, task)
    # Ok, this is a Volkswagen, but bear with me. When we are running
    # inside the test suite, our reactor will finish _first_, and _then_ will our
    # Writable body be read in full. This means that we are going to be
    # throttling the writes but on the other end nobody is really reading much.
    # That, in turn, means that the test will fail as the response will not
    # going to be written in full. There, I said it. This is volkswagen.
    return if 'test' == ENV['RACK_ENV']
    
    # and then see whether we can do anything
    max_waited_s = 15
    backpressure_sleep_s = 0.1
    waited_for_s = 0.0
    while output_body.pending_count > max_queue_items_pending
      LOGGER.debug { "Slow client - putting task to sleep" }
      waited_for_s += backpressure_sleep_s
      if waited_for_s > max_waited_s
        LOGGER.info { "Slow client, closing" }
        raise "Slow client, disconnecting them"
      end
      # There should be a way to awake this task when this WritableBody has been read from on the other end
      task.sleep(backpressure_sleep_s)
    end

    # Let other tasks take things off the queue inside the Body::Writable
    task.yield
  end

To do this I had to expose the queue item count on the Body::Writable thing. Our upstream we pull data from is S3 over HTTPS, but we are pulling many different objects at the same time. Since the default chunk size seems to be hovering around 4KB in our case I opted for 8 items on the queue as limit.

We limited the server to the same number of clients allowed to connect as our current implementation (600) and here is what happened:

grafana-dlserver-async

I think you were right that we needed to test this first, as the mem situation seems to be fine, we are not leaking much - at least not in a few hours we ran the test, so hats off to your buffer size choices and how you managed to reuse a string there ❤️

What we did observe:

Way less context switches, yay!
Doesn't seem to leak. Yay!
We are not achieving the same troughput we used to have with the same number of clients, while we actually want to exceed it (this is a c8 EC2 instance)
We are pegging the CPUs and we are pegging them quite a bit. Most likely this became our limiting factor instead of the network bandwidth we can consume, and I suspect it is because of this pumpin' of strings around. On an rbspy profile from our previous implementation the most CPU time came from running IO.select on a single writable socket, not from the actual sendfile() or IO.copy_stream which are just a write()/read() pair if you do not use SSL. I did not have time to profile this in production yet as we could only carve out a small window to do the experiment and...
It looks like either falcon or nginx is leaking TCP connections. When we looked at netstat we found a ton of connections from nginx downstream to CloudFront origins in TIME_WAIT state. So even though we do send Connection: close on our responses it seems nginx still makes its connections to downstream keepalive, which is not desired. We need to tweak nginx' settings a bit since we do not feel ready exposing a naked Ruby webserver for this workload just yet.
We also dispatch webhooks from this thing (not too many but a few) and during this testing the webhooks were not async-enabled - it was sync Patron, so we were blocking the reactor during their dispatches
We ping Redis during this procedure and this is not async-enabled, but Redis is run locally and very fast so I doubt it contributes much
Our main Rack application action which creates the download manifests is also not Async-enabled at this stage since that would be a bit much rewriting at once

I am intending to force nginx to do a connection: close and the webhook dispatch has been replaced by asynchttp now, so we are going for another round of tests in January. I think we will also reduce the number of processes. But it does seem I do need a lower-level IO strategy for this. I am almost contemplating injecting an Async reactor into Puma on a separate thread so that we can "ship off" hijacked sockets to it. Would welcome any advice ;-)

Dec 20 '18 10:12 julik

That is really useful feedback.

We have async-redis which is pretty decent, but undergoing active development right now.

Falcon does all parsing within Ruby land so it's going to be slower than a server which implements it in C. But for many applications, the overhead is not so big.

Leaking connections seems odd. If you can make a small repro with only Falcon I'd be interested to see it because we also check for leaking sockets. The Falcon test suite is pretty small though.

There are a handful of options.

One thing which might benefit you, is the planned work for a C backend for falcon to optimise the request/response cycle on the server side. This will be an optional paid upgrade. Additionally, if you are interested, I have an open source library which is well proven for handling large numbers of request and large amounts of data. We can shape this into a custom web server for your exact requirements and I guarantee you will achieve within a few % of the theoretical throughput of the hardware/vm.

Dec 20 '18 11:12 ioquatix

Do you mind explaining the path you are taking through Falcon for serving content. Are you using HTTP/1.1? What are you using for the upstream request?

Dec 20 '18 11:12 ioquatix

I will implement back pressure within the queue too - I'll try to make it in the next release. Your implementation might not be optimal.

Dec 20 '18 11:12 ioquatix

We are using HTTP/1.1 from falcon to nginx, and HTTP/1.1 from nginx to CloudFront which is our fronting CDN. HTTP/2 is not in the picture for us at the moment. To clarify: nginx is "downstream" for falcon, CloudFront is "downstream" for nginx. Our "upstreams" (servers our Ruby app is making requests to) are S3 for the data we proxy through and a couple of small requests to our internal systems for metadata, also over HTTP/1.0. These do not egress our VPC and are tiny compared to the amount of data "put through" from S3 to downstream.

One thing which might benefit you, is the planned work for a C backend for falcon to optimise the request/response cycle on the server side.

These are interesting propositions. I did look at the business support model for falcon but I don't think we are ready to commit to it at this stage. First we have a pretty variable workload and though we can predict how many proxy servers we are going to run by way of capacity planning, having what is effectively a support contract for that number of servers might be not very considerate at this stage. It might also happen that we are going to replan to use a different runtime and then can drastically reduce the number of servers since we are going to be able to saturate their NICs to the maximum. Second is we obviously need to see the requisite performance materialise.

So at the moment I think contributing to the ecosystem with explorations, tests and eventual patches might be a better option, but I might be mistaken ofc.

This will be an optional paid upgrade. Additionally, if you are interested, I have an open source library which is well proven for handling large numbers of request and large amounts of data. We can shape this into a custom web server for your exact requirements and I guarantee you will achieve within a few % of the theoretical throughput of the hardware/vm.

I am interested. There is a bit of a concern for me that probably building an entirely custom proprietary webserver might be a bad idea from the point of view of my colleagues since they also will have to support it and debug it should things go south. Let's chat ;-)

Your implementation might not be optimal.

Yes, please ❤️ The best I could find is opportunistically sleep the task for some time, I am certain it could be woken up sooner if the task is somehow coupled to the nio4r monitor.

P.S. I do believe that we could achieve this throughput if it were possible to get access to the nio4r socket objects from within falcon already tho.

Dec 20 '18 12:12 julik

https://github.com/socketry/async-http/issues/6 is now fixed. It has documentation which might help you.

Dec 20 '18 12:12 ioquatix

Awesome!

Dec 20 '18 12:12 julik

First we have a pretty variable workload and though we can predict how many proxy servers we are going to run by way of capacity planning, having what is effectively a support contract for that number of servers might be not very considerate at this stage

If you can think of a better way to do this I am open to ideas.

Dec 20 '18 12:12 ioquatix

P.S. I do believe that we could achieve this throughput if it were possible to get access to the nio4r socket objects from within falcon already tho.

Realistically, the only way to do something like this would be a partial hijack. That's not supported in Falcon at the moment. But maybe it's possible. Only full hijack is supported, and it's Rack compatible so it returns the raw underlying IO, extracted from the reactor:

https://github.com/socketry/falcon/blob/d19f4d095cb380462a4d7e1abea2d25804c10ebd/lib/falcon/adapters/rack.rb#L133-L142

Maybe there is a better way to do this, or even just expose the IO directly in env.

Dec 20 '18 12:12 ioquatix

Can you explain your ideal slow client disconnecting policy? e.g. less than x bytes/s for y minutes? or something else?

Dec 20 '18 22:12 ioquatix

The ideal would be that if bytes per second for a client is below N bytes per second over N seconds I would kick the client out. However, I can in a way "abstract this up" because my chunk size ends up pretty much always being the default nonblocking read chunk size async-http provides, so I can extrapolate the from that and disconnect clients if there is no movement in the queue for that much time - which is the abstraction I found so far. I do have an object that keeps tabs on how much data got sent over the last N seconds and I can use that object as well, but let's try to contemplate the queue length indicator one for a minute.

With the implementation I had there was some measurement because the task would resume in a polling fashion, after some time. With the new LimitedQueue implementation it seems that it is possible that, basically, the task can be stopped indefinitely if nothing starts reading from the queue due to the use of a condition. Imagine this:

Writing scope tries to add an item to the queue, queue is full, condition is created and set to notify the task to resume when something gets read from the queue. The task is "frozen" and the next task is scheduled
Nothing performs a read and the condition never gets a signal.
What then? The task ends up sitting in the task pool and never gets restarted probably? 🤔

I did try a simplistic experiment like this:

it 'reactifies' do
    reactor = Async::Reactor.new
    20.times do |i|
      reactor.run do |task|
        set_timer = task.reactor.after(0.1) { $stderr.puts "task #{i} blows up" }
        set_timer.cancel if i != 3
        $stderr.puts "Hello from task #{i}"
        task.yield
      end
    end
  end

That does work, the task 3 does print data. But if I raise an exception from the after block the exception brings down the reactor (which makes sense if I understand Timers correctly in that the timer is attached not at the task level but at the reactor level). There is also nothing quite like task.raise which is probably also a good thing since Thread#raise was long considered malpractice. But what else should be used in this case? If I manually sleep the task and do a timer comparison when it gets awoken, so even if that would wake the task more often than desired it would let me preempt the task to do the timer bookkeeping.

Basically I need "some" way to forcibly terminate a task if there is no movement on the queue for that much time. Or some way to poll an object for a decision on whether the client should be disconnected or not - IMO if we poll for it once per second or even less often the impact on the reactor wont be immense. I might be overthinking it tho...

Dec 21 '18 03:12 julik

This is something I’ve thought about for a while.

If you call socket.read should that operation block indefinitely?

I think the answer is no. Especially not by default.

There should be some logical timeout, or at least a way to specify it explicitly per socket per operation.

Does the OS provide a timeout? If I make a read operation with no data will it still be waiting 100 years later?

A minimum throughput is a similar issue. We have to be careful to design a system that is actually robust against slow clients, ideally not allocating resources in a way which makes it trivial to DoS a service/system.

Mitigatins at the queue level doesn’t prevent malicious users because there are other non-queue related areas of the protocol which can cause resource exhaustion.

So what I’d like to see is a wrapper around the socket or the stream buffer which handles this for the entire system. Ideally we can specify a policy eg minimum bit rates and timeouts, and have it work across the entire system.

Dec 21 '18 07:12 ioquatix

Yep, being able to set a timeout for each read and each write would be ideal. What I effectively have attempted with my polling solution is actually doing the write part of it. Moreover, if there is a way to set a timer that will awake and raise the fiber before calling write and then to cancel that timer we will have a workable solution. BTW, there is some fascinating thinking about cancelable tasks in http://joeduffyblog.com/2015/11/19/asynchronous-everything/ (in case you haven't seen it)

Dec 21 '18 15:12 julik

Having investigated a bit, would this work? Specifically, will it "override" a Condition?

  TimeoutOnWrite = Struct.new(:async_task, :writable, :timeout_s) do
    def write(data)
      async_task.timeout(timeout_s) { writable.write(data) }
    end
  end
  
  body = Async::HTTP::Body::Writable.new(content_length, queue: Async::LimitedQueue.new(8))
  Async::Reactor.run do |task|
    body_with_timeout = TimeoutOnWrite.new(task, body, 3) # timeout on write in 3 seconds?..
    # ...many times over, repeatedly etc
    body_with_timeout.write(data_from_upstream)

Dec 21 '18 17:12 julik

Unfortunately it's not sufficient.

It needs to be in the buffering/socket layer.

Dec 21 '18 20:12 ioquatix

Yep, tried that implementation and though the timeout does fire it brings down the reactor (and the entire falcon process as well!)

I will revert to my less-than-ideal polling implementation for now.

Dec 21 '18 22:12 julik

You need to handle the timeout.

Something like

begin
	task.timeout do
		socket.write(...)
	end
rescue
	body.close
end

Dec 21 '18 22:12 ioquatix

If that's the route you want to go down, even just temporarily, you should probably make a body wrapper with this behaviour. But as I said it's not a fully general solution.

Dec 21 '18 22:12 ioquatix

Happy 2019 @ioquatix and other socketry contributors!

We have deployed our falcon-based service as a canary and observing the results. Meanwhile I am trying to figure out where the limits are regarding the number of clients and how easy is it for falcon not to saturate the CPU but to "stuff the pipe". To that end I've implemented 3 simple "stuffer" webservers that generate one chunk of random data, and then repeatedly send it over the wire to achieve a given content-length.

To eliminate the network issues from the equation I tested over loopback for now. The results are interesting.

Go with stuffer.go all default options

julik@nanobuk stuffer (master) $ time curl -v http://localhost:9395/?bytes=5861125462 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9395 (#0)
> GET /?bytes=5861125462 HTTP/1.1
> Host: localhost:9395
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Connection: close
< Content-Length: 5861125462
< Date: Fri, 04 Jan 2019 12:28:46 GMT
< Content-Type: application/octet-stream
< 
{ [3953 bytes data]
100 5589M  100 5589M    0     0   720M      0  0:00:07  0:00:07 --:--:--  713M
* Closing connection 0

real	0m7.780s
user	0m2.575s
sys	0m4.542s

Falcon with async-io

julik@nanobuk stuffer (master) $ time curl -v http://localhost:9395/?bytes=5861125462 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9395 (#0)
> GET /?bytes=5861125462 HTTP/1.1
> Host: localhost:9395
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 200
< connection: close
< server: falcon/0.19.6
< date: Fri, 04 Jan 2019 12:41:25 GMT
< content-length: 5861125462
< 
{ [16261 bytes data]
100 5589M  100 5589M    0     0   257M      0  0:00:21  0:00:21 --:--:--  260M
* Closing connection 0

real	0m21.739s
user	0m3.840s
sys	0m7.375s
julik@nanobuk stuffer (master) $

Puma with partial hijack and blocking write()

julik@nanobuk stuffer (master) $ time curl -v http://localhost:9395/?bytes=5861125462 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connection failed
* connect to ::1 port 9395 failed: Connection refused
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 9395 (#0)
> GET /?bytes=5861125462 HTTP/1.1
> Host: localhost:9395
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Connection: close
< Content-Length: 5861125462
< 
{ [16384 bytes data]
100 5589M  100 5589M    0     0   831M      0  0:00:06  0:00:06 --:--:--  842M
* Closing connection 0

real	0m6.742s
user	0m2.361s
sys	0m4.110s

The code is in the repo here: https://github.com/julik/stuffer

Unless I have really missed something, there is a roughly 3x overhead to these async bodies. Which sort of brings back my original question - is there a way, with the existing async-io model, for me to use the sockets directly and yield them back to the reactor if they would block? Or have a minimum size wrapper for this which would work with something like IO.copy_stream or NIO::ByteBuffer which both expect a real fd to be returned from #to_io?

Jan 04 '19 12:01 julik

Without digging into it too much (dude I'm on holiday at the beach), I did a quick test of your code vs using then LimitedQueue and got a 3x perf increase on my old MBP laptop.

Here is your current implementation:

> time curl -v "http://localhost:9292/?bytes=5861125462" > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9292 (#0)
> GET /?bytes=5861125462 HTTP/1.1
> Host: localhost:9292
> User-Agent: curl/7.63.0
> Accept: */*
> 
< HTTP/1.1 200
< server: stuffer/falcon
< connection: close
< content-length: 5861125462
< 
{ [16384 bytes data]
100 5589M  100 5589M    0     0   127M      0  0:00:43  0:00:43 --:--:--  121M
* Closing connection 0
curl -v "http://localhost:9292/?bytes=5861125462" > /dev/null  2.36s user 3.76s system 13% cpu 43.888 total

Here is using LimitedQueue:

> time curl -v "http://localhost:9292/?bytes=5861125462" > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9292 (#0)
> GET /?bytes=5861125462 HTTP/1.1
> Host: localhost:9292
> User-Agent: curl/7.63.0
> Accept: */*
> 
< HTTP/1.1 200
< server: stuffer/falcon
< connection: close
< content-length: 5861125462
< 
{ [40960 bytes data]
100 5589M  100 5589M    0     0   310M      0  0:00:18  0:00:18 --:--:--  283M
* Closing connection 0
curl -v "http://localhost:9292/?bytes=5861125462" > /dev/null  2.41s user 3.76s system 34% cpu 18.027 total

Jan 05 '19 21:01 ioquatix

Without digging into it too much (dude I'm on holiday at the beach)

Man I envy you we are freezing here in the northern hemisphere 🥶 Enjoy your holidays ;-) I will do some experiments with the limited queue just need to find a stopgap measure for it so that I wont have socket starvation on the reading end (connect to us, read 1 block, not read anything for a loooong time all the wile keeping the writing task asleep).

Jan 05 '19 21:01 julik

I think a timeout at the socket level makes sense for almost all protocols.

I don't even know if the timeout allows to be reset but something like this would be nice:

Task.timeout(60) do |timeout|
  body.each do |chunk|
    @stream.write(chunk)
    timeout.reset
  end
end

I wonder if @stream.write should take a flush option too, might minimise the overhead of writing many chunks.

Ultimately the implementation might be better at a lower level. I'll think about it.

Jan 05 '19 21:01 ioquatix

@julik I'm playing around with some timeout/throughput concerns.

Do you think it makes sense to have per-socket timeout for any/all async operations? e.g. connect, send, recv, read, write, etc.
Do you think it makes sense to track throughput and disconnect if less than some minimum?
Do you have any other ideas about how this should work?

I'm thinking that socket timeouts for async operations belong directly in the socket wrapper and apply to all operations.

I also think there are higher level concerns about what constitutes a bad client... but not sure how generally these can be applied or if they should be protocol specific.

I know that slow clients as part of a DoS might continue to consume 1 byte per second. But does that really matter? If you put a low-throughput logic to disconnect sockets, DoS clients can just consume above that water mark. So, can such an approach really protect against malicious clients, or are we just trying to disconnect users who have broken the request upstream somehow (i.e. stopped reading response).

Jan 06 '19 08:01 ioquatix

The other question is, should we have a timeout by default? It seems a bit silly to me that sockets can block code indefinitely.

Jan 06 '19 08:01 ioquatix

TL;DR:

Do you think it makes sense to have per-socket timeout for any/all async operations? e.g. connect, send, recv, read, write, etc.

Yes.

Do you think it makes sense to track throughput and disconnect if less than some minimum?

Yes, or provide a way to do so (hook into NIO)

Do you have any other ideas about how this should work?

At the minimum - two flags on the falcon executable that would set minimum throughput barriers for reading and writing. They could be default-"off" but you do need them.

All good questions. It is ofc to be debated whether it is possible to protect against both slow loris and slow read attacks completely. You could say that it is impossible, just as it is very hard to protect from a high-volume attack. But just like bike locks I think making the attacks less convenient to carry out is a good step to take. Another reason why I think this is especially relevant for Falcon is that from what I understand Falcon is aiming to be the webserver, without a fronting downstream proxy like nginx - which in today's setups generally does a good job of dealing with these attack types. But falcon is also supposed to do SSL termination from what I understand (because HTTP/2 and all), and in general it seems it is aiming to become the server for a machine providing a Ruby web application.

So IMO setting at least _basic_limits to protect people from slow HTTP attacks is in scope for falcon yes. How it should be configurable I don't know but I would say a certain number of bytes must flow through the pipe per second over that many seconds (window average). If this transfer rate is not maintained then the client should be forcibly disconnected. This applies to both reading the HTTP request (slow loris attack) and writing the response (slow read attack). So if you ask me IMO yes, you do need a timeout by default at least when you are not explicitly in websocket mode where a connection might be sleeping for minutes on end. I am not aware of attacks with "slow connect" but probably there are some 🤷‍♀️

I believe puma does not have slow loris protection but it reads the request using its own IO reactor, so it probably relies on the "we can handle many many clients" property of IO reactors for this. For writing Puma is susceptible to slow read as one response consumes a thread. It is probably less severe for falcon due to the intrinsic fact that falcon is one big IO reactor but the max fd limit on the server does become a concern.

That is the "transport" end of the problem, for which there probably should be configurable timeouts on the webserver level (maybe even config options for falcon itself).

In terms of IO - yes, I do believe you want to have configurable timeouts for all reads and writes simply because if you do not have them, say, in your HTTP client, it means you can only make requests to trusted HTTP servers as you need to make an assumption that the endpoint will not "hang you up" indefinitely. It is less of a problem with HTTP endpoints being "adversarial" (it can be if you do web scraping for example, it is a concern!), it can be a problem with endpoints being badly coded. For example there is an SSE endpoint in LaunchDarkly which is currently used via async-http. It is designed to send a "ping" message every now and then to keep the connection alive - and it is all good as long as this option works. but what if it just gives you an EAGAIN once and does not come up in the NIO reactor monitor list for 2 hours after? The calling code currently has to manage this and arrange reconnects if I'm not mistaken. Maybe it is even a feature that belongs in NIO.

For our uses without async-io we opted for configuring libCURL with certain timeouts we know are sane for our uses, and we use both the connect timeouts and the throughput gate (the endpoint must furnish that many bytes within that much time otherwise we bail out).

Regarding the service I am testing falcon on - it is more of an issue protecting from homegrown hand-rolled download managers that open a connection for a bytes=..-.. range of content but do not read it, or do not start reading it in time, or opportunistically open many connections using Range headers in the hopes that they will obtain better download speeds that way (which they won't, but they do consume a connection per range).

I also think there are higher level concerns about what constitutes a bad client... but not sure how generally these can be applied or if they should be protocol specific.

I don't know. I do feel that if there is, intrinsically, a pair of objects servicing a particular client (task + writable) there should be a way to exercise at least "some" push control over the reading client socket - to use these objects to "kick" the client out. If these objects wait on a condition variable for the client to have done something in the first place (if it is always reactive) then this becomes pretty hard to do.

With the current setup what bothers me the most is that I don't know whether falcon will time out a socket in a slow read situation, and if it will - where do I configure the timeout. Cause I sure do need that feature (or I need to investigate the bizarrio nginx documentation to figure out all the options that will, at the same time, protect me from these attacks and not buffer too much response along the way).

Jan 06 '19 13:01 julik

For later - here is how many of these "enforce throughput" exceptions we have noticed since we started running the test code (last Thursday):

slowloris_count

I am going to try to integrate this with the limit queue by putting a few around conditionals on the write of the body meanwhile.

Jan 07 '19 17:01 julik

I've released async v1.13.0 which changed the internal timeout API to raise the Async::TimeoutError exception and I'm preparing an update to async-io which includes a per-socket timeout_duration:

https://github.com/socketry/async-io/blob/e9e7c268324002dc9e4db0f18a93bc4a0a26b38b/spec/async/io/socket_spec.rb#L87-L100

I'm not sure where is the best place to set timeout_duration, but perhaps the endpoint or accept_each could do it as an option.

This should catch connections that simply stop responding.

It won't catch connections that are maliciously slow though.

For that, we need more advanced throughput calculations.

@julik thanks so much for all your detailed feedback, it's really really awesome.

Jan 14 '19 03:01 ioquatix