Yiakwy comments

Results 41 comments of


                                            Yiakwy

add hoper llama golden with mcore calling stack

> --use-legacy-models - why this option is passed ? The latest updates use m-core models by default. For llama2 benchmark test, no need to switch to m-core model and new...

add hoper llama golden with mcore calling stack

> and I found, Using "TOKENIZER_MODEL=meta-llama/Llama-2-7b-hf" in shell script can convert hf to megatron successfully. Hi @carlove **/workspace/models** is the standard location where I have models in the docker, you...

[QUESTION] Can fp8 and pipeline parallelism be used together?

@exnx sorry I don't understand. FP8 has independent groups to keep reduce is accurate. Could it be a problem of your fp8 group and pipeline group setting ?

[Core] Support loading GGUF model

> Yeah, the kernels are CUDA only (and they don't work with ROCm for now). It'd be exciting if this PR can be merged with the proper dequant kernels, so...

[ROCm] Add additional block quant GEMM tuning configs for AMD GPUs.

Hi @whchung, do we have profiling comparison I am really interested in the parameter choosing of "BLOCK_SIZE_N" between 16 and 64. In the last year we have paper fully study...

[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm

> @mgoin @robertgshaw2-neuralmagic additionally we plan to support https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 soon (we already supports https://huggingface.co/amd/Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm but not from this PR), when FBGEMM-FP8 (dynamic per-token activations and per-channel weights) support is ready,...

Yiakwy

add hoper llama golden with mcore calling stack

add hoper llama golden with mcore calling stack

[QUESTION] Can fp8 and pipeline parallelism be used together?

[Core] Support loading GGUF model

[ROCm] Add additional block quant GEMM tuning configs for AMD GPUs.

[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm

[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm

[Call for contributions]The development plan of large-scale EP support in TensorRT-LLM

Driver API

Add multi-device support to ONNX