jetstream-pytorch icon indicating copy to clipboard operation
jetstream-pytorch copied to clipboard

PyTorch/XLA integration with JetStream (https://github.com/google/JetStream) for LLM inference"

Results 11 jetstream-pytorch issues
Sort by recently updated
recently updated
newest added

As reported by @tengomucho Currently there are a few issues with prefill / generate implemention: 1. Prefill does not use `self._sample` to do sampling. 2. Prefill returns a token, so...

I'm receiving an error when attempting to run: ``` ray job submit -- python run_ray_serve_interleave.py --tpu_chips=4 --num_hosts=1 --size=8B --model_name=llama-3 --batch_size=8 --max_cache_length=2048 --tokenizer_path=$tokenizer_path --checkpoint_path=$output_ckpt_dir --quantize_weights=True --quantize_type="int8_per_channel" --quantize_kv_cache=True --sharding_config="default_shardings/llama.yaml" ``` on a...

We saw few service broken and performance degradation in last two weeks, it's hard to identify which PR cause the issues when the issue was found few days later. Right...

- Fix sharding yml file for proper megatron sharding - Add weight processing hook to pad blockwise quantized weight so that the sharded dimension is divisible by the number of...

Sending multiple prompts to the server, only the first prompt is able to return any results. Requests after the first one would only return an empty response. I've tried 3...

The checkpoint conversion script breaks for https://huggingface.co/meta-llama/Llama-2-7b, because it does not have safetensor files. But when running the script, we set --from_hf=True since the checkpoint is from HF. We could...

Right now, ray engine return interleave engine and a tuple separately. In the end, we would like to return a stable Tuple list for both of them.

Currently sampling params such as temperature are set as commandline flags in when the server starts. It would be nice for each request to pass in the sampling params instead.