torchchat icon indicating copy to clipboard operation
torchchat copied to clipboard

Run PyTorch LLMs locally on servers, desktop and mobile

Results 143 torchchat issues
Sort by recently updated
recently updated
newest added

Very slow tokens/second in FP32, feels worse than it should be, but I'm not entirely sure the best way to debug. $ python3 torchchat.py generate --prompt "hello model" -v llama2...

performance

Blobfile not installing correctly with pip. ``` (cchat) scroy@scroy-mbp torchchat % which python /opt/miniconda3/envs/cchat/bin/python (cchat) scroy@scroy-mbp torchchat % pip install blobfile Requirement already satisfied: blobfile in /opt/miniconda3/envs/cchat/lib/python3.10/site-packages (2.1.1) Requirement already...

I tried the following four versions on my Macbook Pro M1. (1) - Really slow ``` python3 torchchat.py generate llama3 --prompt "Hello, my name is" --quantize '{"embedding": {"bitwidth": 4, "groupsize":32},...

performance

@jerryzh168 Please add consistent padding support in torchao to make models quantizable @digantdesai what's the best way to implement this - just round up and ignore part of the result?...

``` (py311) mikekg@mikekg-mbp torchchat % python torchchat.py export --output-dso s.so --quant '{"embedding": {"bitwidth":8, "groupsize": 32}}' --checkpoint-path ${MODEL_PATH} --temperature 0 Using device=cpu Loading model... Time to load model: 0.04 seconds Quantizing...

@ali-khosh user report: I’m being asked “Do you want to enter a system prompt? Enter y for yes and anything else for no.” not sure what this means. When I...

see run here => https://github.com/pytorch/torchchat/actions/runs/8872459136/job/24356835073 We can always upcast to make things pass, but if there's an easy way to build float16 and bfloat16 flavors (iPad Pro has M-series chip...

In https://github.com/pytorch/torchchat/blob/main/build/gguf_loader.py, we directly convert Q4_0 quantized linear weights to _convert_weight_to_int4pack (our native 4-bit quantization in pytorch). All other tensors are converted to float. We should be able to directly...

We are reverting ##539 which added more dtype tests for runner-aoti + runner-et, because of fails - there's no point in having failing tests. That being said, we should figure...

Assumption right now, it's only needed when there is not enough GPU memory, but perhaps sometimes it's just faster this way Right now we only doing tokenization on CPU and...

enhancement