text-generation-inference feat: re-allocate pages dynamically

Today we allocate all pages at once when first scheduling the request. This can lead to under-utilisation as a lot of requests terminate by eos token before reaching max new tokens (see p50 in prod).

This PR re-allocates pages dynamically each time a request finishes a page.

Benchmark = share gpt with max_new_tokens forced at 2048 tokens.

Today: without cache extension == allocate all pages at once

     ✓ Post status is 200

     checks.........................: 100.00% ✓ 91         ✗ 0
     data_received..................: 985 kB  16 kB/s
     data_sent......................: 218 kB  3.6 kB/s
     dropped_iterations.............: 410     6.721098/s
     generated_tokens...............: 13388   219.468448/s
     http_req_blocked...............: avg=134.68µs min=1.98µs  med=139.44µs max=252.97µs p(90)=160.32µs p(95)=187.01µs
     http_req_connecting............: avg=87.16µs  min=0s      med=88.77µs  max=170.74µs p(90)=107.65µs p(95)=126.97µs
     http_req_duration..............: avg=20.01s   min=44.28ms med=19.3s    max=53.21s   p(90)=42.29s   p(95)=47.46s
       { expected_response:true }...: avg=20.01s   min=44.28ms med=19.3s    max=53.21s   p(90)=42.29s   p(95)=47.46s
   ✓ http_req_failed................: 0.00%   ✓ 0          ✗ 91
     http_req_receiving.............: avg=68.66µs  min=27.79µs med=62.52µs  max=151.18µs p(90)=100.17µs p(95)=132.87µs
     http_req_sending...............: avg=39.49µs  min=18.96µs med=39.5µs   max=82.64µs  p(90)=51.91µs  p(95)=59.98µs
     http_req_tls_handshaking.......: avg=0s       min=0s      med=0s       max=0s       p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=20.01s   min=44.05ms med=19.3s    max=53.21s   p(90)=42.29s   p(95)=47.46s
     http_reqs......................: 91      1.491756/s
     inference_time.................: avg=6.67s    min=42ms    med=2.96s    max=41.45s   p(90)=17.26s   p(95)=19.74s
     iteration_duration.............: avg=20.01s   min=45.18ms med=19.3s    max=53.21s   p(90)=42.29s   p(95)=47.46s
     iterations.....................: 91      1.491756/s
     queue_time.....................: avg=13.33s   min=1ms     med=11.75s   max=47.45s   p(90)=30.87s   p(95)=32.49s
     time_per_token.................: avg=107.59ms min=39ms    med=47ms     max=366ms    p(90)=260ms    p(95)=338ms
     total_time.....................: avg=20.01s   min=42ms    med=19.3s    max=53.21s   p(90)=42.29s   p(95)=47.46s
     validation_time................: avg=1ms      min=1ms     med=1ms      max=1ms      p(90)=1ms      p(95)=1ms
     vus............................: 100     min=7        max=100
     vus_max........................: 100     min=100      max=100

With this PR

     ✓ Post status is 200

     checks.........................: 100.00% ✓ 206        ✗ 0
     data_received..................: 2.0 MB  32 kB/s
     data_sent......................: 332 kB  5.4 kB/s
     dropped_iterations.............: 296     4.852316/s
     generated_tokens...............: 26491   434.265927/s
     http_req_blocked...............: avg=58.13µs  min=1.5µs   med=3.6µs   max=250.73µs p(90)=145.47µs p(95)=151.59µs
     http_req_connecting............: avg=36.95µs  min=0s      med=0s      max=174.92µs p(90)=92.87µs  p(95)=98.04µs
     http_req_duration..............: avg=11.43s   min=43.66ms med=6.82s   max=51.01s   p(90)=29.03s   p(95)=34.9s
       { expected_response:true }...: avg=11.43s   min=43.66ms med=6.82s   max=51.01s   p(90)=29.03s   p(95)=34.9s
   ✓ http_req_failed................: 0.00%   ✓ 0          ✗ 206
     http_req_receiving.............: avg=64.4µs   min=14.6µs  med=58.8µs  max=196.64µs p(90)=99.13µs  p(95)=124.75µs
     http_req_sending...............: avg=30.86µs  min=19.16µs med=24.53µs max=404.66µs p(90)=41.19µs  p(95)=44.9µs
     http_req_tls_handshaking.......: avg=0s       min=0s      med=0s      max=0s       p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=11.43s   min=43.5ms  med=6.82s   max=51.01s   p(90)=29.03s   p(95)=34.9s
     http_reqs......................: 206     3.37695/s
     inference_time.................: avg=10.88s   min=42ms    med=5.92s   max=50.7s    p(90)=28.16s   p(95)=34.71s
     iteration_duration.............: avg=11.43s   min=44.34ms med=6.82s   max=51.01s   p(90)=29.04s   p(95)=34.9s
     iterations.....................: 206     3.37695/s
     queue_time.....................: avg=543.15ms min=1ms     med=433.5ms max=2.62s    p(90)=1.13s    p(95)=1.34s
     time_per_token.................: avg=230.86ms min=42ms    med=97ms    max=900ms    p(90)=715ms    p(95)=730.25ms
     total_time.....................: avg=11.43s   min=42ms    med=6.82s   max=51.01s   p(90)=29.03s   p(95)=34.9s
     validation_time................: avg=1ms      min=1ms     med=1ms     max=1ms      p(90)=1ms      p(95)=1ms
     vus............................: 99      min=7        max=100
     vus_max........................: 100     min=100      max=100

Queue time is greatly improved (p99 32s => 1.34s) and throughput is multiplied by > 2 (1.49 => 3.37)

Jun 05 '24 16:06 OlivierDehaene

@zirconium-n Thanks for your input.

Can you provide a bit of background on yourself ? Who are you and how are you trying to help here ?

Your comment definitely seem on point on some aspects but it feels very off on our side to have someone we have no connection with, barge in and comment code authoritatively like you are doing.

Starting with introduction on where you come from and what's your goal will go a long way with us replying in a positive manner.

Jun 06 '24 17:06 Narsil

@zirconium-n Thanks for your input.

Can you provide a bit of background on yourself ? Who are you and how are you trying to help here ?

Your comment definitely seem on point on some aspects but it feels very off on our side to have someone we have no connection with, barge in and comment code authoritatively like you are doing.

Starting with introduction on where you come from and what's your goal will go a long way with us replying in a positive manner.

Ah. Sorry if the comments bothered you. I'm playing with a fork of this repo myself and is messing with this particular part of code recently (and maybe eventually open a PR myself). I noticed there are changes happening on upstream and want to keep up with the latest changes. So I thought I might as well provide some help.

By no means I'm trying to be rude or sound authoritative, just trying to provide some ergonomic nits and ask some questions. I will not engage further if this is unwanted.

Jun 07 '24 08:06 zirconium-n

I will not engage further if this is unwanted.

No this is fine, you can continue, just bear in mind that we might not know all this beforehand :). Thanks for your input.

As for the core of the changes here, it's about becoming optimistic in memory allocation (instead of the current pessimistic approach). So allocating all possible memory for a give request vs allocating later and having to deal with potential OOM situations.

Jun 07 '24 09:06 Narsil

I think this could be interesting especially in context of this pr

https://buildkite.com/vllm/performance-benchmark/builds/4068

Jul 15 '24 20:07 flozi00