JetStream
JetStream copied to clipboard
Performance optimized interleaved mode JetStream server
- Optimized TPU duty cycle (largest gap < 4ms)
- Optimized TTFT: dispatch prefill tasks ASAP w/o unnecessary blocking in CPU, keep backpressure to enforce insert ASAP, return first token ASAP.
- Optimized TPOT: properly enforce generate and detokenize task in sequential w/o unnecessary blocking in CPU.
- Optimized output token throughput: properly prioritize prefill and balancing TTFT and decode in high throughput situation.
- Tested with llama2-70b JetStream MaxText server on v5e-8 VM