Frank Mai
Frank Mai
It seems SGLang do not support gfx1101, see https://github.com/ROCm/aiter/blob/17a25514a1c1294b193eef984089c780e0bf53cf/aiter/jit/utils/chip_info.py#L11-L21, https://github.com/ROCm/aiter/blob/17a25514a1c1294b193eef984089c780e0bf53cf/csrc/cpp_itfs/utils.py#L117-L123. Even configure `HSA_OVERRIDE_GFX_VERSION=11.0.0` to mock gfx1100, I can still encounter an error as below.
@travelyoga You should also provide the vLLM running container's spec with `docker inspect `.
The running container has already been injected with the correct visiable devices. The first RTX A5000 device has been recognized and loaded model weight. The root cause is as below....
I have tested in [Qwen.ai](https://chat.qwen.ai/) twice, and I found that the output of Qwen is stable: the response matches the issue's screenshot. The recognition processing is usually affected by samplers,...
~Fix by https://github.com/gpustack/runtime/commit/ab869fffc6ae49df3138d3662280b461e828d194, this should be included in later release.~ After comparing some information, it was found that `910B` is not the Ascend 910B series, it should belong to Ascend...
I am very confused about why Ollama doesn't use OCI standards to store its models. So I created an alternative to find more answers. https://github.com/gpustack/gguf-packer-go
@lamhktommy can you test this with v0.0.122 ?
> [@lamhktommy](https://github.com/lamhktommy) can you test this with v0.0.122 ? within a further test, v0.0.122(built with Ascend 8.0.rc2.alpha003) is still crashing in a large context size, we have released v0.0.123(built with...
the mul_mat of Q8_0 implementation is limitated at present, let's move out v0.6.0, and figure this out later.
as a workaround, we suggest to use fp16 instead.