fxmarty

Results 316 comments of fxmarty

The kernels from NVIDIA folks at https://github.com/tlc-pack/cutlass_fpA_intB_gemm are probably interesting in the batched scenario.

Thank you, is this issue about speed or logits matching with pytorch? For speed, I'm quite sure IO Binding would help. By the way ```python # converted model ort_opt_model =...

Hi @snowyu, @xenova relies on the ONNX export for transformers.js so it is still to be done!

@uyeongkim I opened a similar issue at: https://github.com/huggingface/huggingface_hub/issues/2281 Related issue for `stream=True`: https://github.com/huggingface/text-generation-inference/issues/1530 Since you use `stream=False`, using simply `requests` instead of huggingface_hub should work for you: ```python import requests...

`Feature '.m16n8k16' requires .target sm_80 or higher` IMO AWQ can't run on T4 GPUs. On A100 you need `TORCH_CUDA_ARCH_LIST="8.0" python setup.py install`

They are the same for act_order=False - just the packing is different. So AWQ kernels & exllama/exllamav2 kernels are essentially doing the same thing.

@frankxyy that I know of, the quantization yields a `g_idx` ordering tensor. The best strategy then with act_order that I know of is to: 1. Reorder in advance weights, scales,...

Oh, 1 and 2 go together. For reference https://github.com/turboderp/exllama/issues/95#issuecomment-1606199301

From my tests AWQ has a worse latency.