NeMo
NeMo copied to clipboard
Llama3-8b FP8 PTQ OOM
Describe the bug
Running FP8 PTQ of Llama3-8b on 1x 4090 (24GB) goes OOM? Is this expected? vLLM FP8 quantization works on the same GPU. What are the minimum requirements to run this quantization?
I have even tried setting batch size to 1 and it still goes OOM.
Steps/Code to reproduce bug
python scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --output_path ./llama3_8b_instruct.nemo --precision bf16
python examples/nlp/language_modeling/megatron_gpt_ptq.py model.restore_from_path=llama3_8b_instruct.nemo quantization.algorithm=fp8 export.decoder_type=llama export.save_path=llama3_8b_instruct_fp8 export.inference_tensor_parallel=1 trainer.num_nodes=1 trainer.devices=1
Environment overview (please complete the following information)
- Environment location: Docker
- Method of NeMo install: source
- If method of install is [Docker], provide
docker pull
&docker run
commands used:
docker run --gpus all -it --rm -v ./NeMo:/NeMo --shm-size=8g \-p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit \stack=67108864 --device=/dev/snd nvcr.io/nvidia/pytorch:23.10-py3