gpt-fast issues

int4 quantization cpu fix

3

int4 quantization on CPU causes: ``` Traceback (most recent call last): File "/home/user/gpt-fast/quantize.py", line 622, in quantize(args.checkpoint_path, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label) File "/home/user/gpt-fast/quantize.py", line 569,...

likholat

CLA Signed

flex_attention ver.

2

Implement gpt-fast using flex_attention HOP. replies on this PR: https://github.com/pytorch/pytorch/pull/132157

joydddd

CLA Signed

Update sdpa function with enable_gqa=True

1

For the llama model, in the sdpa function call, set enable_gqa=True to use the inbuilt grouped query attention functionality

jainapurva

CLA Signed

Decouple int4 weight with serialized format

5

This PR is to decouple int4 weight with serialized format, so that int4 model checkpoint can be shared in different test machines or ISAs, without re-generating in one certain platform....

yanbing-j

CLA Signed

trying to convert huggingface whisper model to pytorch

1

i used the converter here: https://github.com/pytorch-labs/gpt-fast/blob/main/scripts/convert_hf_checkpoint.py but i get this error when trying to convert my huggingface checkpoint: ``` swarms@dpm4:~/gpt-fast/scripts$ python3 converthf_checkpoint.py --checkpoint_dir /home/swarms/checkpoint-4000/ --model name large-v3 Traceback (most recent...

nullonesix

tokenizer.model

1

I fine-tuned an llm based on the llama skeleton and used convert_hf_checkpoint and quantize to complete the quantification. However, when generating, the tokenizer.model file is missing. How can I operate...

hasakikiki

It doesn't accelerate very well at L4

1

I'm glad the torch.compile is speeding up very quickly. On A5000 it can speed up 60%, but there's no acceleration at l4. I want to know why is it happen?...

songh11

CLA Signed

gpt-fast
gpt-fast copied to clipboard

Metadata

int4 quantization cpu fix

flex_attention ver.

Update sdpa function with enable_gqa=True

Decouple int4 weight with serialized format

trying to convert huggingface whisper model to pytorch

tokenizer.model

It doesn't accelerate very well at L4

getting different acceptance prob when using `torch.compile` after making a small change.

GGUF support?

[WIP] Use DTensor-based tensor parallel

← Metadata

Owner

Metadata

gpt-fast gpt-fast copied to clipboard

Metadata

← Metadata

Owner

Metadata

gpt-fast
gpt-fast copied to clipboard