JetStream
JetStream copied to clipboard
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
- General cleanup of instructions - Generalizing the script to convert llama checkpoints s.t. it supports custom GCS buckets for the maxtext checkpoints. - Adding quantization instructions
Currently the model conversion script will [create a bucket](https://github.com/google/JetStream/blob/main/jetstream/tools/maxtext/model_ckpt_conversion.sh#L36) `export MODEL_BUCKET=gs://${USER}-maxtext`. However, it may be the case that the `gs://${USER}-maxtext` path already exists, which I imagine would break the script....
great work! But when to support/release gpu?
- Optimized TPU duty cycle (largest gap < 4ms) - Optimized TTFT: dispatch prefill tasks ASAP w/o unnecessary blocking in CPU, keep backpressure to enforce insert ASAP, return first token...
Do not merge until https://github.com/google/JetStream/pull/127
I should be able to serve a model by simply providing the HuggingFace model ID. Requiring users to convert checkpoints is too troublesome.
I have conducted an analysis of the `request-rate` and `interval` variables in the `benchmarking_script.py` and would like to ensure that my understanding is correct. My understanding is that the `request-rate`...