NeMo Llama3-8b FP8 PTQ OOM

Llama3-8b FP8 PTQ OOM

Open JeevanBhoot opened this issue 6 months ago • 2 comments

Describe the bug

Running FP8 PTQ of Llama3-8b on 1x 4090 (24GB) goes OOM? Is this expected? vLLM FP8 quantization works on the same GPU. What are the minimum requirements to run this quantization?

I have even tried setting batch size to 1 and it still goes OOM.

Steps/Code to reproduce bug

python scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --output_path ./llama3_8b_instruct.nemo --precision bf16

python examples/nlp/language_modeling/megatron_gpt_ptq.py model.restore_from_path=llama3_8b_instruct.nemo quantization.algorithm=fp8 export.decoder_type=llama export.save_path=llama3_8b_instruct_fp8 export.inference_tensor_parallel=1 trainer.num_nodes=1 trainer.devices=1

Environment overview (please complete the following information)

Environment location: Docker
Method of NeMo install: source
If method of install is [Docker], provide docker pull & docker run commands used:

docker run --gpus all -it --rm -v ./NeMo:/NeMo --shm-size=8g \-p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit \stack=67108864 --device=/dev/snd nvcr.io/nvidia/pytorch:23.10-py3

Jul 31 '24 14:07 JeevanBhoot

NeMo NeMo copied to clipboard

Llama3-8b FP8 PTQ OOM

NeMo
NeMo copied to clipboard