JetStream icon indicating copy to clipboard operation
JetStream copied to clipboard

JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).

Results 21 JetStream issues
Sort by recently updated
recently updated
newest added

- General cleanup of instructions - Generalizing the script to convert llama checkpoints s.t. it supports custom GCS buckets for the maxtext checkpoints. - Adding quantization instructions

Currently the model conversion script will [create a bucket](https://github.com/google/JetStream/blob/main/jetstream/tools/maxtext/model_ckpt_conversion.sh#L36) `export MODEL_BUCKET=gs://${USER}-maxtext`. However, it may be the case that the `gs://${USER}-maxtext` path already exists, which I imagine would break the script....

great work! But when to support/release gpu?

- Optimized TPU duty cycle (largest gap < 4ms) - Optimized TTFT: dispatch prefill tasks ASAP w/o unnecessary blocking in CPU, keep backpressure to enforce insert ASAP, return first token...

Do not merge until https://github.com/google/JetStream/pull/127

I should be able to serve a model by simply providing the HuggingFace model ID. Requiring users to convert checkpoints is too troublesome.

I have conducted an analysis of the `request-rate` and `interval` variables in the `benchmarking_script.py` and would like to ensure that my understanding is correct. My understanding is that the `request-rate`...