Tianlei Wu
Tianlei Wu
I saw the absolute difference is not large: ``` Greatest absolute difference: 0.00011079013347625732 at index (0, 573) (up to 1e-05 allowed) ``` I suggest to use end-to-end metric (like precision...
@TedThemistokleous, support of 'optional' of bias is added for T5 model in https://github.com/microsoft/onnxruntime/pull/14928. It is supported by CUDA provider. However, CPU provider still requires bias input.
It's clear in operator spec. I think CPU EP need slight modification to follow the operator spec, or update the error message for not-implemented feature to avoid confusing users. Let...
@TedThemistokleous, For LLM, typical onnx usage is mixing 4bits and 8bits. Most weights can be quantized to 4 bits, some layers need more bits and we normally use 8 bits...
We can add the support of FP6 once onnx adds it. Feature | NVIDIA CUDA (12.9+) | AMD ROCm (7.0+) -- | -- | -- FP6 Formats | e2m3, e3m2...
Even though this approach works when you build from source and run in current machine, the binary might not be able to run in another GPU. If we want to...
> I do understand if this isn't accepted in the ORT codebase because of this, but maybe then we could work together on a better way to do it. It's...
@Numeri, I do not have idea why LaunchTritonKernel is causing memory access errors. You can do some debugging, like starts with tensor of one element, and add some printf before...