Yiakwy
Yiakwy
> --use-legacy-models - why this option is passed ? The latest updates use m-core models by default. For llama2 benchmark test, no need to switch to m-core model and new...
> and I found, Using "TOKENIZER_MODEL=meta-llama/Llama-2-7b-hf" in shell script can convert hf to megatron successfully. Hi @carlove **/workspace/models** is the standard location where I have models in the docker, you...
@exnx sorry I don't understand. FP8 has independent groups to keep reduce is accurate. Could it be a problem of your fp8 group and pipeline group setting ?
> Yeah, the kernels are CUDA only (and they don't work with ROCm for now). It'd be exciting if this PR can be merged with the proper dequant kernels, so...
Hi @whchung, do we have profiling comparison I am really interested in the parameter choosing of "BLOCK_SIZE_N" between 16 and 64. In the last year we have paper fully study...
> @mgoin @robertgshaw2-neuralmagic additionally we plan to support https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 soon (we already supports https://huggingface.co/amd/Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm but not from this PR), when FBGEMM-FP8 (dynamic per-token activations and per-channel weights) support is ready,...
> @yiakwy-xpu-ml-framework-team ROCm6.2 supports fp8 natively via hipBLALt, triton and CK (not brought in vLLM use yet). Current MI300 FP8 format is somehow different than OCP format, we introduced max....
Great to hear this! @juney-nvidia, do we have a plan to setup EP partition analytic models ? It is generally believed that EP should be evenly distributed to each nodes...
Note __frcp_rn is not supported in ROCM 6.2. Many customer codes invovles PTX inline asm. We need a table to show - How amd asm is different from PTX and...
## Partition scheme > Currently ONNX doesn't have a way of encoding how a model can be parallelized across multiple devices. Yes I think parallel model across devices includes two...