torchchat issues

x86 CPU: BF16 should improve decoding performance relative to FP32 on x86, even without hardware BF16

3

### 🚀 The feature, motivation and pitch As you might expect given that decoding is memory-bandwidth-bound, bf16 is roughly twice as fast as fp32 on my M1 Mac: (`python torchchat.py...

swolchok

enhancement

performance

actionable

GeneratorArgs.is_torchtune_model is a misnomer

### 🚀 The feature, motivation and pitch `is_torchtune_model` is a misnomer and can result in buggy code. It gates logic for models that have [`tune` suffix](https://github.com/pytorch/torchchat/blob/d0993b3508f802e81a6917b8959907a9abff827a/torchchat/generate.py#L143), but not all torchtune...

Jack-Khuu

Support Granite Code 3B/8B

### 🚀 The feature, motivation and pitch The `torchchat` framework provides an excellent platform for embedding models into many different edge-centric platforms. The [Granite Code models](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330), specifically the [3B-128k](https://huggingface.co/ibm-granite/granite-3b-code-instruct-128k) and...

gabe-l-hart

[aoti] Remove need for -l in cmake call

1

Removes the need for the `-l` in the cmake call by storing the tokenizer type during export time in the PT2. Stacked on top of https://github.com/pytorch/torchchat/pull/896

angelayi

CLA Signed

Llama 3.2 MM Multiturn Browser: Second message errors out

### 🐛 Describe the bug Kick off a server (tested on CPU) ` python3 torchchat.py server llama3.2-11B` In a separate terminal open the browser: `streamlit run torchchat/usages/browser.py` First send a...

Jack-Khuu

bug

Browser

Llama 3.2- Multimodal

llama-3.2-11b-vision : size mismatch for encoder.clip.token_pos_embedding.global_token_positional_embedding

4

### 🐛 Describe the bug From a clean install using the current main branch, llama-3.2-11b-vision seems to need some love. The download of the model files from HugginFace succeded using...

openconcerto

bug

Llama 3.2- Multimodal

Distributed inference runtime error

1

### 🐛 Describe the bug When trying to run distributed/run_dist_inference.sh . It has below error. [rank0]:[rank0]: model = _load_model(builder_args) [rank0]:[rank0]: File "/scratch/grace/torchchat/torchchat/cli/builder.py", line 473, in _load_model [rank0]:[rank0]: model = _maybe_parellelize_model(model,...

guijuzhang

Distributed

CLI chat mode doesn't work on 11b model

### 🐛 Describe the bug chat mode on cli as well as on browser does not work on 11b model ### Versions na

Gasoonjia

Known Gaps

Llama 3.2- Multimodal

A lot of duplicate code between generate.py and the openai_api.py

1

### 🐛 Describe the bug the api uses generate.py. The duplicate code should be consolidated in generate.py and utility functions ### Versions N/A

byjlw

int4_weight_only in Cuda compile := RuntimeError: _apply(): Couldn't swap Linear.weight

4

### 🐛 Describe the bug When generating multiple samples from a compiled int4 model on CUDA, a runtime error occurs relating to Linear.weight swapping: ``` Traceback (most recent call last):...

Jack-Khuu

bug

Compile / AOTI

Quantization

torchchat
torchchat copied to clipboard

Metadata

x86 CPU: BF16 should improve decoding performance relative to FP32 on x86, even without hardware BF16

GeneratorArgs.is_torchtune_model is a misnomer

Support Granite Code 3B/8B

[aoti] Remove need for -l in cmake call

Llama 3.2 MM Multiturn Browser: Second message errors out

llama-3.2-11b-vision : size mismatch for encoder.clip.token_pos_embedding.global_token_positional_embedding

Distributed inference runtime error

CLI chat mode doesn't work on 11b model

A lot of duplicate code between generate.py and the openai_api.py

int4_weight_only in Cuda compile := RuntimeError: _apply(): Couldn't swap Linear.weight

← Metadata

Owner

Metadata

torchchat torchchat copied to clipboard

Metadata

← Metadata

Owner

Metadata

torchchat
torchchat copied to clipboard