jetstream-pytorch issues

Issues with prefill & generate

As reported by @tengomucho Currently there are a few issues with prefill / generate implemention: 1. Prefill does not use `self._sample` to do sampling. 2. Prefill returns a token, so...

qihqi

Error Running `run_ray_serve_interleave` with Llama3 8B

I'm receiving an error when attempting to run: ``` ray job submit -- python run_ray_serve_interleave.py --tpu_chips=4 --num_hosts=1 --size=8B --model_name=llama-3 --batch_size=8 --max_cache_length=2048 --tokenizer_path=$tokenizer_path --checkpoint_path=$output_ckpt_dir --quantize_weights=True --quantize_type="int8_per_channel" --quantize_kv_cache=True --sharding_config="default_shardings/llama.yaml" ``` on a...

ryanaoleary

Add regression test to detect service broken and performance degradation

2

We saw few service broken and performance degradation in last two weeks, it's hard to identify which PR cause the issues when the issue was found few days later. Right...

FanhaiLu1

commit act quant for conditional ffn

qihqi

Fix blockwise sharding

- Fix sharding yml file for proper megatron sharding - Add weight processing hook to pad blockwise quantized weight so that the sharded dimension is divisible by the number of...

lsy323

Empty response returned for prompt responses when using run_server_with_ray.py and batch_size > 1

2

Sending multiple prompts to the server, only the first prompt is able to return any results. Requests after the first one would only return an empty response. I've tried 3...

richardsliu

qihqi

Make sure the server does not crash if the input is too long

qihqi

jetstream-pytorch
jetstream-pytorch copied to clipboard

Metadata

Issues with prefill & generate

Error Running `run_ray_serve_interleave` with Llama3 8B

Add regression test to detect service broken and performance degradation

commit act quant for conditional ffn

Fix blockwise sharding

Empty response returned for prompt responses when using run_server_with_ray.py and batch_size > 1

Checkpoint conversion script breaks for meta-llama/llama-2-7b on HF

Return Tuple(interleaveEngList, prefillEngineList, decodeEngineList) in create ray engine

[Feature Request] Per request sampling params

Make sure the server does not crash if the input is too long

← Metadata

Owner

Metadata

jetstream-pytorch jetstream-pytorch copied to clipboard

Metadata

← Metadata

Owner

Metadata

jetstream-pytorch
jetstream-pytorch copied to clipboard