gpt-fast issues

Normal Inference seems to output more tokens per second.

1

Hi, I just did a quick implementation of gpt-fast and did an inference on the Llama-2 7B. I seem to get around 65 tokens per second on average without quantization....

tamil-acog

Will these optimization integrate into hf's code?

7

so that every one can use it out-of-box?

lucasjinreal

About benchmark results

1

Amaizing work! How was the benchmark results obtained? Is it just the generation speed measured when the GPU power limited 330W?

1787648106

Extended support for existing precision variable

The existing code defines a variable `precision` ([here](https://github.com/pytorch-labs/gpt-fast/blob/3bcaaaf068d112d534f335ec21a17d7b8b5551bf/generate.py#L266)) which is then used in `_load_model()` [here](https://github.com/pytorch-labs/gpt-fast/blob/3bcaaaf068d112d534f335ec21a17d7b8b5551bf/generate.py#L230) to set the dtype for the model. However, this variable was not getting passed to...

ankitvgupta

CLA Signed

Speed up model loading by 25%

2

Load checkpoint directly to device, in my testing loading llama 7B went from 7.83 to 5.78 seconds (about 25% faster). I also noticed that memory usage doubles temporarily, at least...

daulet

CLA Signed

Problems with interactive mode

I want to use an interactive mode and use this command >python generate.py --compile --interactive --draft_checkpoint_path checkpoints/$DRAFT_MODEL_REPO/model_int8.pth --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth --speculate_k 3 However, I got the following error, could you please...

XuandongZhao

Inductor Op Lowering

3

Great blogpost! Is there any documentation on how `inductor` lowers the `ops` in the `fx graph` to actual kernels -- specifically the optimization / tuning that determines the actual kernel...

jeromeku

index out of bounds for --compile_prefill with int4 and int8

3

Running 13b chat model on L4 GPU with ``` python generate.py --checkpoint_path .../model_int4.g32.pth --compile --compile_prefill ``` An error happens ``` Traceback (most recent call last): File "/home/user/gpt-fast/generate.py", line 407, in...

lopuhin

Remove the eos parameter of the encode_tokens method

2

gklab

CLA Signed

performance loss for int4 compare with AWQ?

lucasjinreal

gpt-fast
gpt-fast copied to clipboard

Metadata

Normal Inference seems to output more tokens per second.

Will these optimization integrate into hf's code?

About benchmark results

Extended support for existing precision variable

Speed up model loading by 25%

Problems with interactive mode

Inductor Op Lowering

index out of bounds for --compile_prefill with int4 and int8

Remove the eos parameter of the encode_tokens method

performance loss for int4 compare with AWQ?

← Metadata

Owner

Metadata

gpt-fast gpt-fast copied to clipboard

Metadata

← Metadata

Owner

Metadata

gpt-fast
gpt-fast copied to clipboard