Results 61 comments of Meng, Hengyu

Hi, thank you for your excellent work! I am quite new to MLIR so the questions may be stupid, please never mind. I see ```ArgMax``` is supported in onnx-mlir but...

Hi wenjingk, we prefer tile-wise sparsity in INC, which is balance between accuracy and performance. Tile-wise sparsity means to divide the whole matrix into tiles, if there are non-zero elements...

hi @Jacoby1218 could you provide some reference data to show the magnitude of gaps? for example, performance on RTX-4070ti (16 GB), or entirely on iGPU/CPU?

I think this maybe due to lacking optimization on multi-batch, has been recordd in https://github.com/ggerganov/llama.cpp/discussions/5277, please stay tuned!

I think this has been improved with https://github.com/ggerganov/llama.cpp/pull/6217, please give a try.

Not sure whether it is expected GPU MAX1100 passed all but on MTL iGPU, f16 off, win11, Intel(R) oneAPI DPC++/C++ Compiler 2024.0.2 (2024.0.2.20231213) ```powershell C:\Users\gta\Documents\llama.cpp\build>bin\test-backend-ops.exe test -b SYCL0 -o SOFT_MAX...

> > Not sure whether it is expected > > GPU MAX1100 passed all > > but on MTL iGPU, f16 off, win11, Intel(R) oneAPI DPC++/C++ Compiler 2024.0.2 (2024.0.2.20231213) >...

have you tried ``` GGML_SYCL_DEVICE=3```? This is wield because mostly dGPU will appear as the first device, but in your case is 3 and 5. Can you try the following...

https://github.com/ggerganov/llama.cpp/pull/5624 @aahouzi I think the "hanging" issues has been solved by the above PR, did you use this commit?

> @NeoZhangJianyu I saw u created a revert PR, is #5901 merged or there is no change yet ? It is merged by mistake. the author will re-implement it with...