A trivial buffer pool is used to get rid of these buffer allocations:

per connection/receiver buffers (Ocsigen_http_com)
internal buffer in Ocsigen_senders.File_content
internal buffer in deflatemod

Moreover, some care is taken to avoid allocation for the last chunk of a file when the remaining data doesn't match the buffer size.

The changes are minimal, but reviews are appreciated. In particular:

I haven't tested deflatemod yet
I made a mistake in my first attempt at addressing the last-chunk issue (forgot that the buffer size could be rounded up to the next power of two), so any extra eyeballs to look for errors would help
I had to release the response stream resources (i.e. buffers) in Ocsigen_server; it's literally 3 lines, but it's quite big conceptually

Overall, these changes deliver up to ~25-30% more speed in some requests.

Performance

Effect of buffer pools

(buffer pool disabled with OCSIGEN_DISABLE_BUFFER_POOL -> enabled) o=150 is the GC setting used in half of the runs

404
        12000 -> 12800
 o=150  12900 -> 13500


normal response (20 bytes)
        ~ 8100 -> 8900
  o=150 ~ 8400 -> 8900


16kb file ->
          6721 -> 7800
   o=150  7100 -> 7900

Effect of last-chunk optimization when serving files

PRE                                       POST
----------------------                    ----------------------
15kb file ->                              15kb file ->
          6200 -> 7400                              6400 -> 7950
   o=150  7000 -> 7800                       o=150  7200 -> 8200

7kb file ->                               7kb file ->
          7500 -> 8550                              7450 -> 9150
   o=150  7650 -> 8700                       o=150  7850 -> 9250


1kb file ->                               1kb file ->
          8100 -> 8750                              8100 -> 8750
   o=150  8400 -> 8800                       o=150  8550 -> 9000

----------------------                    ----------------------

Overall effect of `Lwt_chan` fix #49 + buffer pools + last-chunk optimization

(only for files): compare figures at the top + those in POST to these for 2.6:

2.6
--------------

404                                    16kb file ->
         11800                                   5500
 o=150   12500                            o=150  6100

20-byte response                       15kb file ->  
          7450                                   5200
   o=150  7900                            o=150  5900

                                       7kb file ->   
                                                 6000
                                          o=150  6700


                                       1kb file ->   
                                                 7650
                                          o=150  7950

Aug 16 '15 21:08 mfp

AFAICS it builds & installs fine w/ jenkins (certainly builds for me). Same (phony?) uninstall error as in #84.

Aug 16 '15 21:08 mfp

Thanks for doing this. I haven't read the last-chunk part carefully, but the rest of the code looks OK to me. The overhead (hashtable, stacks, release closures) should be insignificant for buffers of size 1K or more.

I can install and uninstall (the buffer-pool branch) just fine locally, it is hard to say why rm fails inside the docker container.

Aug 17 '15 12:08 vasilisp

On Mon, Aug 17, 2015 at 05:38:28AM -0700, Vasilis Papavasileiou wrote:

Thanks for doing this. I haven't read the last-chunk part carefully, but the rest of the code looks OK to me.

Thank you for taking a look :)

The overhead (hashtable, stacks, release closures) should be insignificant for buffers of size 1K or more.

Indeed. In this case there's a speed increase across the board for any response size (even 404s and 20-byte responses become a bit faster) -- ocsigenserver cannot know a priori that a connection will not make full use of the connection buffers, so it's got to use large ones.

For hashtables of such size (a few dozen elements at most) I've measured over 4e7 lookups/s (we only need a handful of them per request, so it's over 3 orders of magnitude away :). The closures are essentially free (with Lwt, we're building much heavier ones all the time anyway), and the only thing I didn't know how to quantify easily was the caml_modify inherent in Stack.push, but it clearly involves less work than the mark&sweeping we'd have if we were allocating 4KB buffers left and right.

I can install and uninstall (the buffer-pool branch) just fine locally, it is hard to say why rm fails inside the docker container.

I'm told it's an issue in the build script used for PRs.

Mauricio Fernández

Aug 17 '15 13:08 mfp

ocsigenserver
ocsigenserver copied to clipboard

Use buffer pool to avoid large buffer allocation for connections and responses

Performance

Effect of buffer pools

Effect of last-chunk optimization when serving files

Overall effect of `Lwt_chan` fix #49 + buffer pools + last-chunk optimization

ocsigenserver ocsigenserver copied to clipboard

Use buffer pool to avoid large buffer allocation for connections and responses

Performance

Effect of buffer pools

Effect of last-chunk optimization when serving files

Overall effect of Lwt_chan fix #49 + buffer pools + last-chunk optimization

ocsigenserver
ocsigenserver copied to clipboard

Overall effect of `Lwt_chan` fix #49 + buffer pools + last-chunk optimization