feat: re-allocate pages dynamically
Today we allocate all pages at once when first scheduling the request. This can lead to under-utilisation as a lot of requests terminate by eos token before reaching max new tokens (see p50 in prod).
This PR re-allocates pages dynamically each time a request finishes a page.
Benchmark = share gpt with max_new_tokens forced at 2048 tokens.
- Today: without cache extension == allocate all pages at once
✓ Post status is 200
checks.........................: 100.00% ✓ 91 ✗ 0
data_received..................: 985 kB 16 kB/s
data_sent......................: 218 kB 3.6 kB/s
dropped_iterations.............: 410 6.721098/s
generated_tokens...............: 13388 219.468448/s
http_req_blocked...............: avg=134.68µs min=1.98µs med=139.44µs max=252.97µs p(90)=160.32µs p(95)=187.01µs
http_req_connecting............: avg=87.16µs min=0s med=88.77µs max=170.74µs p(90)=107.65µs p(95)=126.97µs
http_req_duration..............: avg=20.01s min=44.28ms med=19.3s max=53.21s p(90)=42.29s p(95)=47.46s
{ expected_response:true }...: avg=20.01s min=44.28ms med=19.3s max=53.21s p(90)=42.29s p(95)=47.46s
✓ http_req_failed................: 0.00% ✓ 0 ✗ 91
http_req_receiving.............: avg=68.66µs min=27.79µs med=62.52µs max=151.18µs p(90)=100.17µs p(95)=132.87µs
http_req_sending...............: avg=39.49µs min=18.96µs med=39.5µs max=82.64µs p(90)=51.91µs p(95)=59.98µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...............: avg=20.01s min=44.05ms med=19.3s max=53.21s p(90)=42.29s p(95)=47.46s
http_reqs......................: 91 1.491756/s
inference_time.................: avg=6.67s min=42ms med=2.96s max=41.45s p(90)=17.26s p(95)=19.74s
iteration_duration.............: avg=20.01s min=45.18ms med=19.3s max=53.21s p(90)=42.29s p(95)=47.46s
iterations.....................: 91 1.491756/s
queue_time.....................: avg=13.33s min=1ms med=11.75s max=47.45s p(90)=30.87s p(95)=32.49s
time_per_token.................: avg=107.59ms min=39ms med=47ms max=366ms p(90)=260ms p(95)=338ms
total_time.....................: avg=20.01s min=42ms med=19.3s max=53.21s p(90)=42.29s p(95)=47.46s
validation_time................: avg=1ms min=1ms med=1ms max=1ms p(90)=1ms p(95)=1ms
vus............................: 100 min=7 max=100
vus_max........................: 100 min=100 max=100
- With this PR
✓ Post status is 200
checks.........................: 100.00% ✓ 206 ✗ 0
data_received..................: 2.0 MB 32 kB/s
data_sent......................: 332 kB 5.4 kB/s
dropped_iterations.............: 296 4.852316/s
generated_tokens...............: 26491 434.265927/s
http_req_blocked...............: avg=58.13µs min=1.5µs med=3.6µs max=250.73µs p(90)=145.47µs p(95)=151.59µs
http_req_connecting............: avg=36.95µs min=0s med=0s max=174.92µs p(90)=92.87µs p(95)=98.04µs
http_req_duration..............: avg=11.43s min=43.66ms med=6.82s max=51.01s p(90)=29.03s p(95)=34.9s
{ expected_response:true }...: avg=11.43s min=43.66ms med=6.82s max=51.01s p(90)=29.03s p(95)=34.9s
✓ http_req_failed................: 0.00% ✓ 0 ✗ 206
http_req_receiving.............: avg=64.4µs min=14.6µs med=58.8µs max=196.64µs p(90)=99.13µs p(95)=124.75µs
http_req_sending...............: avg=30.86µs min=19.16µs med=24.53µs max=404.66µs p(90)=41.19µs p(95)=44.9µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...............: avg=11.43s min=43.5ms med=6.82s max=51.01s p(90)=29.03s p(95)=34.9s
http_reqs......................: 206 3.37695/s
inference_time.................: avg=10.88s min=42ms med=5.92s max=50.7s p(90)=28.16s p(95)=34.71s
iteration_duration.............: avg=11.43s min=44.34ms med=6.82s max=51.01s p(90)=29.04s p(95)=34.9s
iterations.....................: 206 3.37695/s
queue_time.....................: avg=543.15ms min=1ms med=433.5ms max=2.62s p(90)=1.13s p(95)=1.34s
time_per_token.................: avg=230.86ms min=42ms med=97ms max=900ms p(90)=715ms p(95)=730.25ms
total_time.....................: avg=11.43s min=42ms med=6.82s max=51.01s p(90)=29.03s p(95)=34.9s
validation_time................: avg=1ms min=1ms med=1ms max=1ms p(90)=1ms p(95)=1ms
vus............................: 99 min=7 max=100
vus_max........................: 100 min=100 max=100
Queue time is greatly improved (p99 32s => 1.34s) and throughput is multiplied by > 2 (1.49 => 3.37)
@zirconium-n Thanks for your input.
Can you provide a bit of background on yourself ? Who are you and how are you trying to help here ?
Your comment definitely seem on point on some aspects but it feels very off on our side to have someone we have no connection with, barge in and comment code authoritatively like you are doing.
Starting with introduction on where you come from and what's your goal will go a long way with us replying in a positive manner.
@zirconium-n Thanks for your input.
Can you provide a bit of background on yourself ? Who are you and how are you trying to help here ?
Your comment definitely seem on point on some aspects but it feels very off on our side to have someone we have no connection with, barge in and comment code authoritatively like you are doing.
Starting with introduction on where you come from and what's your goal will go a long way with us replying in a positive manner.
Ah. Sorry if the comments bothered you. I'm playing with a fork of this repo myself and is messing with this particular part of code recently (and maybe eventually open a PR myself). I noticed there are changes happening on upstream and want to keep up with the latest changes. So I thought I might as well provide some help.
By no means I'm trying to be rude or sound authoritative, just trying to provide some ergonomic nits and ask some questions. I will not engage further if this is unwanted.
I will not engage further if this is unwanted.
No this is fine, you can continue, just bear in mind that we might not know all this beforehand :). Thanks for your input.
As for the core of the changes here, it's about becoming optimistic in memory allocation (instead of the current pessimistic approach). So allocating all possible memory for a give request vs allocating later and having to deal with potential OOM situations.
I think this could be interesting especially in context of this pr
https://buildkite.com/vllm/performance-benchmark/builds/4068