Yufeng Li comments

Results 86 comments of


                                            Yufeng Li

Incorrect/Garbage Responses for Llama-2-7b-hf with INT4 GPTQ/RTN Asymmetric Quantization

And for the issue in the original post, do you run on Windows or Linux?

Incorrect/Garbage Responses for Llama-2-7b-hf with INT4 GPTQ/RTN Asymmetric Quantization

@VishalX, just FYI. It turns out something wrong with mmap on windows. If I turns off mmap, Asymmetric works on Windows. You can try it out with this branch if...

Incorrect/Garbage Responses for Llama-2-7b-hf with INT4 GPTQ/RTN Asymmetric Quantization

> @yufenglee, I tried Asymmetric BlockWise, RTN & GPTQ, with the above fix. Responses for all these include German sentences/words. Do you think this is due to quantization loss only?...

[Performance] nearest neighbor Resize operator is significantly slower than pytorch for 3D tensors

@SimonRelu, could you please profile the model and see if all nodes are running on GPU?

AOT compilation

Nice! Do you have a rough estimation when it will be done?

AOT compilation

And what will AoT compilation generate, a C/C++ API plus source/.so?

[Feature Request] 4bit and 2bit and 1bit quantization support

> Being able to convert a HF model for 4-bit quantization would be awesome!! The QLLM tool can convert a 4-bit HF model to ONNX: https://github.com/wejoncy/QLLM. And a tool from...

Importing onnxruntime on AWS Lambdas with ARM64 processor causes crash

You need to include both #10199 and #10334 .

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator

> Thanks for your report! What's the accuracy level of this model's MatMulNBits? we use the fp32

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator

This is the tool to get the benchmark number: https://github.com/microsoft/onnxruntime-genai/tree/main/benchmark/python