perf: Avoid copying the request body more than once
When running the memory tests:
$ nix-shell --arg memory true
$ postgrest-test-memory
The following profiling report is produced:
Fri Apr 22 16:16 2022 Time and Allocation Profiling Report (Final)
postgrest +RTS -N -I2 -p -h -RTS
total time = 0.01 secs (32 ticks @ 1000 us, 16 processors)
total alloc = 22,870,568 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
MAIN MAIN <built-in> 37.5 1.2
text_strict PostgreSQL.Binary.Decoding library/PostgreSQL/Binary/Decoding.hs:(252,1)-(260,49) 9.4 6.2
pack PostgreSQL.Binary.Integral library/PostgreSQL/Binary/Integral.hs:(11,1)-(12,52) 6.2 6.9
unknown Hasql.Private.Encoders library/Hasql/Private/Encoders.hs:294:1-76 6.2 0.0
receive Network.Wai.Handler.Warp.Recv Network/Wai/Handler/Warp/Recv.hs:(103,1)-(117,69) 6.2 0.3
sendQueryPrepared Database.PostgreSQL.LibPQ src/Database/PostgreSQL/LibPQ.hsc:(1729,1)-(1750,40) 3.1 15.0
getvalue' Database.PostgreSQL.LibPQ src/Database/PostgreSQL/LibPQ.hsc:(1359,1)-(1367,54) 3.1 3.6
connectdb Database.PostgreSQL.LibPQ src/Database/PostgreSQL/LibPQ.hsc:(291,1)-(299,51) 3.1 0.0
decoder Hasql.Private.Decoders.Value library/Hasql/Private/Decoders/Value.hs:20:3-7 3.1 2.9
field Hasql.Private.Decoders library/Hasql/Private/Decoders.hs:(469,1)-(471,69) 3.1 8.6
dimension Hasql.Private.Decoders library/Hasql/Private/Decoders.hs:444:1-73 3.1 1.6
createPool Data.Pool Data/Pool.hs:(154,1)-(177,10) 3.1 0.1
cors Network.Wai.Middleware.Cors src/Network/Wai/Middleware/Cors.hs:(291,1)-(409,9) 3.1 0.0
recvRequest Network.Wai.Handler.Warp.Request Network/Wai/Handler/Warp/Request.hs:(64,1)-(100,50) 3.1 0.0
normalizedBody PostgREST.Query.SqlFragment src/PostgREST/Query/SqlFragment.hs:(112,1)-(123,102) 3.1 14.8
logWithZTime PostgREST.AppState src/PostgREST/AppState.hs:(143,1)-(145,87) 3.1 0.3
nullableValue Hasql.Private.Encoders.Params library/Hasql/Private/Encoders/Params.hs:(22,1)-(31,47) 0.0 14.9
value Hasql.Private.Decoders.Row library/Hasql/Private/Decoders/Row.hs:(42,3)-(55,33) 0.0 5.2
decode Hasql.Private.Decoders.Row library/Hasql/Private/Decoders/Row.hs:54:36-92 0.0 2.3
element Hasql.Private.Decoders library/Hasql/Private/Decoders.hs:(451,1)-(453,61) 0.0 1.8
composite Hasql.Private.Decoders library/Hasql/Private/Decoders.hs:411:1-69 0.0 1.2
array Hasql.Private.Decoders library/Hasql/Private/Decoders.hs:372:1-57 0.0 1.6
readPGRSTEnvironment PostgREST.Config src/PostgREST/Config.hs:(439,1)-(440,83) 0.0 5.1
It seems the request body is copied 3 times in sendQueryPrepared, normalizedBody, nullableValue. Since we don't do any processing on the body, ideally it would just be copied once and then sent to PostgreSQL.
On normalizedBody, it can be seen the clear culprit is toStrict:
https://github.com/PostgREST/postgrest/blob/902e4e8921779346adc19ff6690d03820be2b797/src/PostgREST/Query/SqlFragment.hs#L111-L123
We need to use that because Hasql needs a strict ByteString on its unknown encoder. Not sure why nullableValue and sendQueryPrepared copy the value again though.
nullableValue
Code path: nullableValue -> encodingBytes -> builderBytes -> unsafeCreate
unsafeCreate :: Int -> (Ptr Word8 -> IO ()) -> ByteString | Source A way of creating ByteStrings outside the IO monad. The Int argument gives the final size of the ByteString. Unlike createAndTrim the ByteString is not reallocated if the final size is less than the estimated size.
So clearly unsafeCreate is doing the allocation here.
sendQueryPrepared
sendQueryPrepared -> useAsCString
useAsCString O(n) construction Use a ByteString with a function requiring a null-terminated CString. The CString is a copy and will be freed automatically; it must not be stored or used after the subcomputation finishes.
useAsCString ditto.
It's a bit of a mess this one. I think LinearTypes would ensure this doesn't happen on libraries.
There are some options for making progress on this. The current best state is https://github.com/PostgREST/postgrest/pull/2349, which
- avoids encodeAsCString copying by switching to
jsonBytesencoding (so we have non-zero-terminated binary encoding) and modify postgresql-libpq to use unsafeUseAsCString - avoids a hasql binary encoding copy by adding a lazy bytestring encoder variant of
jsonBytes
The postgresql-libpq change seems like a reasonable change to get merged upstream. Regarding hasql, it's a bit less clear how to fit our needs into the API nicely -- currently exploring this.
For the postgresql-libpq change in particular we could move ahead and build against a patched fork since the API doesn't change.
Also let's note some options that go beyond these "easy" wins:
- We might be able to read the body into a strict bytestring directly. If we get a content-length header, we should be able to allocate the body buffer directly. That could tie in with the changes above, and avoid the copy from lazy bytestring to bytestring.
- Using libpq (the C library) is what requires us to copy the full body to memory at all. A native Haskell postgres client library would allow us to write directly to the outgoing socket. postgresql-pure provides some of this, although I don't think the API is quite ideal for what we'd want. Still, it would be interesting to experiment with it. E.g. it should not be too hard to port hasql to use this as a backend.
- Tying in with this, the ideal would be to stream the body. It seems hard to fit "streaming parameters" into hasql though, even if we have a pure Haskell postgres client library that supports this. I'm unsure if it'd be possible to do something like this by hand while continuing to use the parts of hasql.
Does "streaming the body" mean the database connection is taken from the pool and blocked once a http request starts and is still ongoing?
I think this could, while improving memory usage, decrease performance overall, because it would keep the database connections blocked for longer, saturating the pool faster.
I think there is value in making sure to receive a full request first and then sending it to the database at once. And that includes making a single copy of the request body, I assume.
Does "streaming the body" mean the database connection is taken from the pool and blocked once a http request starts and is still ongoing?
Indeed. It's a good point, thanks, easy to get carried away when optimizing a single parameter :).
For the postgresql-libpq change in particular we could move ahead and build against a patched fork since the API doesn't change.
Sounds good, we can use the fork for now.
We might be able to read the body into a strict bytestring directly. If we get a content-length header, we should be able to allocate the body buffer directly. That could tie in with the changes above, and avoid the copy from lazy bytestring to bytestring. Using libpq (the C library) is what requires us to copy the full body to memory at all.
Just to clarify. The above means that even with the optimizations in the current libraries, we'll always need one extra copy of the request body right?
Regarding the content-length header, it should work good if we can fallback to doing one copy when the header is not present.
The postgresql-pure idea looks even better. If it takes a while to do it perhaps we can do the content-length one(if it's simple) and work on postgresql-pure as a next step?
Just to clarify. The above means that even with the optimizations in the current libraries, we'll always need one extra copy of the request body right?
Yes, with the current changes, we always read the body to memory in chunks (a lazy bytestring), and need to copy that to a single chunk to pass on to libpq. (Which is why I'm confused by the numbers in the memory tests -- I'd expect that we need to keep both the lazy and the strict bytestring in memory at the same time, so a peak memory use of 2x body size.)
Just tested v10 against the latest pre-release(v10.0.0.20221011, includes https://github.com/PostgREST/postgrest/pull/2349) using this benchmark setup with this k6 script(a bulk POST) and varying the number of virtual users(10, 50, 100).
v10
data_received..............: 6.9 MB 216 kB/s
data_sent..................: 309 MB 9.7 MB/s
✓ failed requests............: 0.00% ✓ 0 ✗ 43073
http_req_blocked...........: avg=6.98µs min=1.23µs med=2.27µs max=2.71ms p(90)=3.65µs p(95)=5.29µs
http_req_connecting........: avg=3.11µs min=0s med=0s max=2.29ms p(90)=0s p(95)=0s
✓ http_req_duration..........: avg=6.08ms min=2.45ms med=5.84ms max=1.77s p(90)=8.23ms p(95)=9.03ms
http_req_receiving.........: avg=58.51µs min=14.5µs med=44µs max=3.79ms p(90)=61.58µs p(95)=84.56µs
http_req_sending...........: avg=55.05µs min=25.81µs med=51.35µs max=2.28ms p(90)=68.14µs p(95)=83.53µs
http_req_tls_handshaking...: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...........: avg=5.97ms min=2.3ms med=5.73ms max=1.77s p(90)=8.11ms p(95)=8.92ms
http_reqs..................: 43074 1353.110185/s
iteration_duration.........: avg=6.99ms min=3.24ms med=6.73ms max=1.77s p(90)=9.25ms p(95)=10.1ms
iterations.................: 43073 1353.078771/s
vus........................: 0 min=0 max=10
vus_max....................: 10 min=10 max=10
data_received..............: 8.1 MB 245 kB/s
data_sent..................: 350 MB 11 MB/s
✓ failed requests............: 1.52% ✓ 751 ✗ 48508
http_req_blocked...........: avg=6.73µs min=1.22µs med=2.21µs max=2.03ms p(90)=3.61µs p(95)=5.28µs
http_req_connecting........: avg=2.98µs min=0s med=0s max=680.92µs p(90)=0s p(95)=0s
✓ http_req_duration..........: avg=29.6ms min=3ms med=26.27ms max=2.95s p(90)=39.85ms p(95)=49.13ms
http_req_receiving.........: avg=54.57µs min=18.04µs med=44.07µs max=5.79ms p(90)=60.14µs p(95)=79.9µs
http_req_sending...........: avg=54.43µs min=28.01µs med=51.28µs max=4.87ms p(90)=66.9µs p(95)=80.96µs
http_req_tls_handshaking...: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...........: avg=29.49ms min=2.9ms med=26.15ms max=2.95s p(90)=39.74ms p(95)=49.02ms
http_reqs..................: 49260 1491.34639/s
iteration_duration.........: avg=30.51ms min=3.83ms med=27.15ms max=2.95s p(90)=40.79ms p(95)=50.16ms
iterations.................: 49259 1491.316115/s
vus........................: 0 min=0 max=50
vus_max....................: 50 min=50 max=50
data_received..............: 8.5 MB 226 kB/s
data_sent..................: 367 MB 9.7 MB/s
✓ failed requests............: 2.15% ✓ 1114 ✗ 50648
http_req_blocked...........: avg=13.88µs min=1.22µs med=2.2µs max=9.44ms p(90)=3.54µs p(95)=5.02µs
http_req_connecting........: avg=10.12µs min=0s med=0s max=9.28ms p(90)=0s p(95)=0s
✓ http_req_duration..........: avg=57.24ms min=3.65ms med=42.55ms max=7.59s p(90)=76.13ms p(95)=88.87ms
http_req_receiving.........: avg=54.14µs min=15.75µs med=44.81µs max=8.56ms p(90)=60.7µs p(95)=75.48µs
http_req_sending...........: avg=52.55µs min=26.55µs med=50.99µs max=2.41ms p(90)=66.26µs p(95)=80.64µs
http_req_tls_handshaking...: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...........: avg=57.14ms min=3.56ms med=42.44ms max=7.59s p(90)=76.02ms p(95)=88.75ms
http_reqs..................: 51763 1373.36774/s
iteration_duration.........: avg=58.13ms min=4.5ms med=43.46ms max=7.59s p(90)=77.01ms p(95)=89.73ms
iterations.................: 51762 1373.341208/s
vus........................: 0 min=0 max=100
vus_max....................: 100 min=100 max=100
v10.0.0.20221011
data_received..............: 7.0 MB 217 kB/s
data_sent..................: 313 MB 9.7 MB/s
✓ failed requests............: 0.00% ✓ 0 ✗ 43681
http_req_blocked...........: avg=7.12µs min=1.23µs med=2.27µs max=3.24ms p(90)=3.66µs p(95)=5.49µs
http_req_connecting........: avg=3.2µs min=0s med=0s max=3.2ms p(90)=0s p(95)=0s
✓ http_req_duration..........: avg=6ms min=2.5ms med=5.78ms max=2.12s p(90)=7.91ms p(95)=8.67ms
http_req_receiving.........: avg=57.65µs min=16.92µs med=43.72µs max=4.34ms p(90)=60.55µs p(95)=83.78µs
http_req_sending...........: avg=55.41µs min=24.83µs med=51.44µs max=1.8ms p(90)=68.45µs p(95)=82.51µs
http_req_tls_handshaking...: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...........: avg=5.89ms min=2.32ms med=5.67ms max=2.12s p(90)=7.8ms p(95)=8.55ms
http_reqs..................: 43682 1356.964725/s
iteration_duration.........: avg=6.91ms min=3.25ms med=6.67ms max=2.12s p(90)=8.92ms p(95)=9.79ms
iterations.................: 43681 1356.933661/s
vus........................: 0 min=0 max=10
vus_max....................: 10 min=10 max=10
data_received..............: 8.1 MB 240 kB/s
data_sent..................: 356 MB 11 MB/s
✓ failed requests............: 0.71% ✓ 357 ✗ 49690
http_req_blocked...........: avg=7.47µs min=1.13µs med=2.21µs max=4.56ms p(90)=3.63µs p(95)=5.61µs
http_req_connecting........: avg=3.17µs min=0s med=0s max=3.03ms p(90)=0s p(95)=0s
✓ http_req_duration..........: avg=29.11ms min=4.96ms med=26.57ms max=3.61s p(90)=38.4ms p(95)=46.64ms
http_req_receiving.........: avg=55.68µs min=16.28µs med=43.85µs max=9.68ms p(90)=59.98µs p(95)=81.03µs
http_req_sending...........: avg=54.21µs min=27.98µs med=51.28µs max=6.65ms p(90)=66.91µs p(95)=82.28µs
http_req_tls_handshaking...: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...........: avg=29ms min=4.86ms med=26.46ms max=3.61s p(90)=38.28ms p(95)=46.52ms
http_reqs..................: 50048 1485.597503/s
iteration_duration.........: avg=30.04ms min=5.73ms med=27.47ms max=3.61s p(90)=39.35ms p(95)=47.57ms
iterations.................: 50047 1485.567819/s
vus........................: 0 min=0 max=50
vus_max....................: 50 min=50 max=50
data_received..............: 8.1 MB 233 kB/s
data_sent..................: 353 MB 10 MB/s
✓ failed requests............: 1.24% ✓ 620 ✗ 49134
http_req_blocked...........: avg=12.85µs min=1.2µs med=2.2µs max=8.09ms p(90)=3.61µs p(95)=5.19µs
http_req_connecting........: avg=9.09µs min=0s med=0s max=7.78ms p(90)=0s p(95)=0s
✓ http_req_duration..........: avg=59.51ms min=3.16ms med=44.94ms max=4.74s p(90)=74.99ms p(95)=90.09ms
http_req_receiving.........: avg=55.63µs min=19.82µs med=44.88µs max=29.96ms p(90)=60.25µs p(95)=77.31µs
http_req_sending...........: avg=53.33µs min=29.16µs med=51.37µs max=7.62ms p(90)=66.36µs p(95)=79.67µs
http_req_tls_handshaking...: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...........: avg=59.4ms min=2.76ms med=44.83ms max=4.74s p(90)=74.89ms p(95)=89.94ms
http_reqs..................: 49755 1427.870379/s
iteration_duration.........: avg=60.43ms min=6.29ms med=45.83ms max=4.74s p(90)=75.89ms p(95)=91.07ms
iterations.................: 49754 1427.84168/s
vus........................: 0 min=0 max=100
vus_max....................: 100 min=100 max=100
Looks like there's a slight improvement over the % of failed requests and throughput.
Also saw less %MEM on top while doing the load tests. v10 reached %MEM 11.8 and v10.0.0.20221011 reached %MEM 9.8.
Using libpq (the C library) is what requires us to copy the full body to memory at all.
@robx Say if:
- We convert the bytestring to a CString type
- hasql accepted a CString.
Would that allow us to have a single copy?