fxmarty comments

Results 316 comments of


                                            fxmarty

Latency grows substantially as batch size increases, even with small batch sizes

The kernels from NVIDIA folks at https://github.com/tlc-pack/cutlass_fpA_intB_gemm are probably interesting in the batched scenario.

[Issue]: install.sh is not friendly with `&>` log file redirection

Yes!

ONNX converted seq2seq model cannot batch decoding

Thank you, is this issue about speed or logits matching with pytorch? For speed, I'm quite sure IO Binding would help. By the way ```python # converted model ort_opt_model =...

[ONNX export] Bark for realistic text-to-speech generation

Hi @snowyu, @xenova relies on the ONNX export for transformers.js so it is still to be done!

text generation details not working when stream=False

@uyeongkim I opened a similar issue at: https://github.com/huggingface/huggingface_hub/issues/2281 Related issue for `stream=True`: https://github.com/huggingface/text-generation-inference/issues/1530 Since you use `stream=False`, using simply `requests` instead of huggingface_hub should work for you: ```python import requests...

Tesla T4 Feature '.m16n8k16' requires .target sm_80 or higher

`Feature '.m16n8k16' requires .target sm_80 or higher` IMO AWQ can't run on T4 GPUs. On A100 you need `TORCH_CUDA_ARCH_LIST="8.0" python setup.py install`

difference from gptq when inferring

They are the same for act_order=False - just the packing is different. So AWQ kernels & exllama/exllamav2 kernels are essentially doing the same thing.

difference from gptq when inferring

@frankxyy that I know of, the quantization yields a `g_idx` ordering tensor. The best strategy then with act_order that I know of is to: 1. Reorder in advance weights, scales,...

difference from gptq when inferring

Oh, 1 and 2 go together. For reference https://github.com/turboderp/exllama/issues/95#issuecomment-1606199301

inference speed

From my tests AWQ has a worse latency.