faraday
faraday copied to clipboard
Refactor dequeue to return option
When compiling to JS (via js_of_ocaml) local exceptions used for control flow are getting very expensive (V8 at least). This PR tries to alleviate some of the places, where functions with local exceptions appeared in the top of Node.js profile.
Hm, isn't that option nearly guaranteed to not escape the minor heap? In this case allocation will only cost as much as addition to machine word in memory (which will likely be in CPU cache if not in some register). GC of minor heap is just move of this machine word back, promoting all live objects to major heap, which is the most expensive part. But may be I'm just plain wrong on this as I'm not an export in OCaml GC.
Explicit checks for queue lengths would look a bit ugly... Do you have some way of profiling this? I'm not experience with proper tools to do profiling for OCaml code either.
I've ran some benchmarks with httpaf (while troubleshooting httpaf/139) and looks like this PR does not introduce a slowdown, but actually improves performance a bit. Also it fixes the stack overflow, described in linked httpaf issue. Here are some numbers:
Each run is curl launch performing POST of large file to lwt_echo_post example from httpaf repository:
curl -H"Expect:" -XPOST -d @very_big_file -o/dev/null http://0.0.0.0:8080/
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 7240M 0 3620M 100 3620M 513M 513M 0:00:07 0:00:07 --:--:-- 1037M
Development and release correspond to dune profiles used to build the examples.
I may be misunderstanding your column labels, and the units that are being reported. But assuming that "patched" is this PR, and the numbers are latencies, then it looks like the patched version is slightly slower.
In any case, the point of avoiding allocation is not to avoid the cost of allocation, but to avoid the cost of garbage collection. The only way to collect less it to generate less garbage, which was my explicit goal when writing the Deque
module. I'd like to preserve that property.
Sorry for poorly presented benchmark results, columns are megabytes per second, as reported in curl output (columns "Average Dload" or "Speed Upload"). "patched" is this PR, "vanilla" is e82d6a0.
I've added printing of GC stats at server exit and compared the numbers, option-based version indeed increases the amount of work for GC. I've implemented the approach that you suggested as follows:
diff --git a/lib/faraday.ml b/lib/faraday.ml
index f25db14..4429e7a 100644
--- a/lib/faraday.ml
+++ b/lib/faraday.ml
@@ -39,8 +39,6 @@ type 'a iovec =
; off : int
; len : int }
-exception Dequeue_empty
-
module Deque(T:sig type t val sentinel : t end) : sig
type elem = T.t
@@ -95,13 +93,10 @@ end = struct
t.back <- t.back + 1
let dequeue_exn t =
- if is_empty t then
- raise Dequeue_empty
- else
- let result = Array.unsafe_get t.elements t.front in
- Array.unsafe_set t.elements t.front sentinel;
- t.front <- t.front + 1;
- result
+ let result = Array.unsafe_get t.elements t.front in
+ Array.unsafe_set t.elements t.front sentinel;
+ t.front <- t.front + 1;
+ result
let enqueue_front e t =
(* This is in general not true for Deque data structures, but the usage
@@ -367,21 +362,22 @@ let yield t =
t.yield <- true
let rec shift_buffers t written =
- try
+ if not (Buffers.is_empty t.scheduled) then begin
let { len; _ } as iovec = Buffers.dequeue_exn t.scheduled in
if len <= written then begin
shift_buffers t (written - len)
end else
Buffers.enqueue_front (IOVec.shift iovec written) t.scheduled
- with Dequeue_empty ->
+ end else begin
assert (written = 0);
if t.scheduled_pos = t.write_pos then begin
t.scheduled_pos <- 0;
t.write_pos <- 0
end
+ end
let rec shift_flushes t =
- try
+ if not (Flushes.is_empty t.flushed) then begin
let (threshold, f) as flush = Flushes.dequeue_exn t.flushed in
(* Edited notes from @dinosaure:
*
@@ -401,8 +397,7 @@ let rec shift_flushes t =
if t.bytes_written - min_int >= threshold - min_int
then begin f (); shift_flushes t end
else Flushes.enqueue_front flush t.flushed
- with Dequeue_empty ->
- ()
+ end else ()
let shift t written =
shift_buffers t written;
Then I've ran 3 tests with wrk to measure latencies. Below are outputs of wrk, followed by output of GC.print_stat
of the server process.
Vanilla version
$ wrk -t1 -c 200 -d 10s http://localhost:8080/
Running 10s test @ http://localhost:8080/
1 threads and 200 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.58ms 452.77us 8.50ms 81.94%
Req/Sec 55.92k 2.93k 59.07k 81.00%
555941 requests in 10.01s, 1.08GB read
Requests/sec: 55546.71
Transfer/sec: 110.93MB
minor_collections: 2297
major_collections: 384
compactions: 0
minor_words: 551697546
promoted_words: 67680984
major_words: 67682138
top_heap_words: 491520
heap_words: 491520
live_words: 239168
free_words: 252333
largest_free: 208523
fragments: 19
live_blocks: 66162
free_blocks: 767
heap_chunks: 1
Patched (option)
$ wrk -t1 -c 200 -d 10s http://localhost:8080/
Running 10s test @ http://localhost:8080/
1 threads and 200 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.53ms 445.94us 6.88ms 81.91%
Req/Sec 56.71k 3.48k 60.36k 84.00%
564045 requests in 10.01s, 1.10GB read
Requests/sec: 56371.60
Transfer/sec: 112.57MB
minor_collections: 2339
major_collections: 390
compactions: 0
minor_words: 562042048
promoted_words: 68913676
major_words: 68914830
top_heap_words: 491520
heap_words: 491520
live_words: 232971
free_words: 258425
largest_free: 124108
fragments: 124
live_blocks: 64418
free_blocks: 39812
heap_chunks: 1
Patched (is_empty call)
$ wrk -t1 -c 200 -d 10s http://localhost:8080/
Running 10s test @ http://localhost:8080/
1 threads and 200 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.68ms 481.83us 9.13ms 75.55%
Req/Sec 54.59k 3.33k 58.81k 77.00%
542740 requests in 10.01s, 1.06GB read
Requests/sec: 54234.01
Transfer/sec: 108.31MB
minor_collections: 2240
major_collections: 373
compactions: 0
minor_words: 538418413
promoted_words: 66124974
major_words: 66126128
top_heap_words: 491520
heap_words: 491520
live_words: 206070
free_words: 285385
largest_free: 30362
fragments: 65
live_blocks: 56328
free_blocks: 53155
heap_chunks: 1
I'm not sure why, but the direct is_empty check is the slowest, although it generates least amount of work for GC.