torchchat issues

AMD CPU generation is very slow

5

Very slow tokens/second in FP32, feels worse than it should be, but I'm not entirely sure the best way to debug. $ python3 torchchat.py generate --prompt "hello model" -v llama2...

ianbarber

performance

Issue with blobfile installing leading to non-deterministic failures on CI

3

Blobfile not installing correctly with pip. ``` (cchat) scroy@scroy-mbp torchchat % which python /opt/miniconda3/envs/cchat/bin/python (cchat) scroy@scroy-mbp torchchat % pip install blobfile Requirement already satisfied: blobfile in /opt/miniconda3/envs/cchat/lib/python3.10/site-packages (2.1.1) Requirement already...

metascroy

Potentially slow when running quantized versions on Desktop CPU

2

I tried the following four versions on my Macbook Pro M1. (1) - Really slow ``` python3 torchchat.py generate llama3 --prompt "Hello, my name is" --quantize '{"embedding": {"bitwidth": 4, "groupsize":32},...

mergennachin

performance

[TORCHAO] Handle non-multiple group sizes, support padding as appropriate in torchao and kernels

2

@jerryzh168 Please add consistent padding support in torchao to make models quantizable @digantdesai what's the best way to implement this - just round up and ignore part of the result?...

mikekgfb

[FEATURE REQUEST] Clang vectoriation on ARM: `warning: loop not vectorized`

4

``` (py311) mikekg@mikekg-mbp torchchat % python torchchat.py export --output-dso s.so --quant '{"embedding": {"bitwidth":8, "groupsize": 32}}' --checkpoint-path ${MODEL_PATH} --temperature 0 Using device=cpu Loading model... Time to load model: 0.04 seconds Quantizing...

mikekgfb

[User Experience] User does not know what is expected by prompts

@ali-khosh user report: I’m being asked “Do you want to enter a system prompt? Enter y for yes and anything else for no.” not sure what this means. When I...

mikekgfb

[FEATURE REQUEST] Create sdpa_with_kv support for float16, bfloat16

1

see run here => https://github.com/pytorch/torchchat/actions/runs/8872459136/job/24356835073 We can always upcast to make things pass, but if there's an easy way to build float16 and bfloat16 flavors (iPad Pro has M-series chip...

mikekgfb

[FEATURE REQUEST] natively parse 4-bit embedding quantized tensors from GGUF Q4_0 files

In https://github.com/pytorch/torchchat/blob/main/build/gguf_loader.py, we directly convert Q4_0 quantized linear weights to _convert_weight_to_int4pack (our native 4-bit quantization in pytorch). All other tensors are converted to float. We should be able to directly...

metascroy

[CI] add dtype tests for runner-aoti and runner-et

2

We are reverting ##539 which added more dtype tests for runner-aoti + runner-et, because of fails - there's no point in having failing tests. That being said, we should figure...

mikekgfb

[Feature request] Support CPU+GPU mixed execution

1

Assumption right now, it's only needed when there is not enough GPU memory, but perhaps sometimes it's just faster this way Right now we only doing tokenization on CPU and...

malfet

enhancement

torchchat
torchchat copied to clipboard

Metadata

AMD CPU generation is very slow

Issue with blobfile installing leading to non-deterministic failures on CI

Potentially slow when running quantized versions on Desktop CPU

[TORCHAO] Handle non-multiple group sizes, support padding as appropriate in torchao and kernels

[FEATURE REQUEST] Clang vectoriation on ARM: `warning: loop not vectorized`

[User Experience] User does not know what is expected by prompts

[FEATURE REQUEST] Create sdpa_with_kv support for float16, bfloat16

[FEATURE REQUEST] natively parse 4-bit embedding quantized tensors from GGUF Q4_0 files

[CI] add dtype tests for runner-aoti and runner-et

[Feature request] Support CPU+GPU mixed execution

← Metadata

Owner

Metadata

torchchat torchchat copied to clipboard

Metadata

← Metadata

Owner

Metadata

torchchat
torchchat copied to clipboard