Wuwei Lin
Wuwei Lin
per-tensor quantization that was added recently is for fp8, so far we have tested on mixtral and llama and more work such as calibration scale is in progress
It’s supported. Are you requesting precompiled package?
do we need to update cublas codegen or runtime to support the cast?
If you call a pass directly (instead of using `Sequential`, it will bypass the check for `opt_level`, `required_pass`, etc.
I’ll send a new PR, but this might be a real issue.. in the past I already rerun CI multiple times
@Jiawei-Shao in this case, we can do `sch.vectorize(ax1)` to convert the loop to a vectorized one. https://github.com/apache/tvm/blob/main/src/target/spirv/spirv_utils.cc#L123 will rewrite buffer with vectorized access to `int8x4` as long as both read...
we are still using c++17, the error is because c++20 is not enabled
Did you checkout the updated the TVM submodule? You also need to re compile the model
@MasterJH5574 seems the submodule already contained the fix for missing func for TIR kv cache. Anything missing?
the performance issue might be caused by https://github.com/apache/tvm/pull/17326, it is not expected to change the original prefill behavior though