Majidur Rahman
Majidur Rahman
This is the script that I used: #! /bin/bash MASTER_ADDR=localhost MASTER_PORT=${2-2012} NNODES=1 NODE_RANK=0 GPUS_PER_NODE=${3-16} DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT" # model...
Yes. I tried running on a clean environment. But still getting memory errors like the following: "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB. GPU"
I tried using MP_SIZE = 4, but I'm still getting the "out of memory" error. Do you have any suggestions in this regard?
Interestingly, I successfully ran the script for "LLaMA2-7B" with MP_SIZE = 4.
I am checking to see if I am doing this correctly. 1. I have downloaded the LLaMA2-13B from HuggingFace (https://huggingface.co/meta-llama/Llama-2-13b-hf). 2. I have generated a weight configuration file named "llama2.json,"...