Unsuccessful quantitative modeling using the MAIN method
log (computing)
(aqlm) root@f9f90a551b02:~/xinglin-data/AQLM# bash train.sh
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.18.2
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
============ Load model... ============ Loading checkpoint shards: 100%|██████████████████████████████| 17/17 [00:01<00:00, 11.35it/s] Loading pretrained model ... Model loaded sucсessfully ...
============ Quantizing model... ============
Loading data ...
/root/xinglin-data/AQLM/src/datautils.py:219: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
data = torch.load(name)[:nsamples]
Loaded data from /root/xinglin-data/AQLM/train.pt; len(data)=1024 sequences
Starting AQ quantization ... catching layer inputs from data train.sh: line 23: 28722 Killed python main.py $MODEL_PATH $DATASET_PATH --nsamples=1024 --val_size=0 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=32 --relative_mse_tolerance=0.01 --finetune_batch_size=32 --finetune_max_epochs=10 --finetune_early_stop=3 --finetune_keep_best --local_batch_size=1 --offload_activations --wandb --resume --save $SAVE_PATH
configure
export CUDA_VISIBLE_DEVICES=0 # or e.g. 0,1,2,3 export MODEL_PATH=/root/xinglin-data/model/Qwen/Qwen2.5-32B-Instruct export DATASET_PATH=/root/xinglin-data/AQLM/train.pt export SAVE_PATH=/root/xinglin-data/Qwen2 export WANDB_PROJECT=MY_AQ_EXPS export WANDB_NAME=COOL_EXP_NAME
python main.py $MODEL_PATH $DATASET_PATH
--nsamples=1024
--val_size=0
--num_codebooks=1
--nbits_per_codebook=16
--in_group_size=32
--relative_mse_tolerance=0.01
--finetune_batch_size=32
--finetune_max_epochs=10
--finetune_early_stop=3
--finetune_keep_best
--local_batch_size=1
--offload_activations
--wandb
--resume
--save $SAVE_PATH
Hello, @Jun-Howie!
Most likely you did not have enough RAM.
You are using nsamples=1024, the Qwen2.5-32B-Instruct model, and the --offload_activations key. Using the --offload_activations key means that the inps (of size [1024, 4096, 5120]) and outs (of size [1024, 4096, 5120]) tensors will be stored in RAM. Here 1024 is the value of nsamples, 4096 is the default model_seqlen value, 5120 is the hidden_size value of the Qwen2.5-32B-Instruct model.
Let's calculate how much RAM you will need for inps and outs. In your case, the data type of the inps and outs tensors will be bfloat16, i.e. 2 bytes per tensor parameter. Hence,
- inps: 1024 * 4096 * 5120 * 2 / 1024 / 1024 = 40960 Mb,
- outs: 1024 * 4096 * 5120 * 2 / 1024 / 1024 = 40960 Mb.
In total we get 40960 + 40960 = 81920 Mb. You use the --offload_activations key, so this memory will be used in RAM.
You can get around this problem by taking a smaller value of nsamples (for example, nsamples=512) or by using multiple GPU devices without the --offload_activations key.
Thank you , I use nsamples=512 which works for me!
A new problem has been encountered in quantizing Qwen2.5-7B with NVIDIA 4090.
(aqlm) root@Harvey-4090:/home/harvey/PM/AQLM# bash train.sh
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.18.3
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
============ Load model... ============ Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 19.97it/s] Loading pretrained model ... Model loaded sucсessfully ...
============ Quantizing model... ============
Loading data ...
/home/harvey/PM/AQLM/src/datautils.py:219: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
data = torch.load(name)[:nsamples]
Loaded data from /home/harvey/PM/data/4096.pth; len(data)=256 sequences
Starting AQ quantization ...
catching layer inputs from data
Traceback (most recent call last):
File "/home/harvey/PM/AQLM/main.py", line 918, in TORCH_USE_CUDA_DSA to enable device-side assertions.
wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /home/harvey/PM/AQLM/wandb/offline-run-20241009_213235-pykhgwzr wandb: Find logs at: wandb/offline-run-20241009_213235-pykhgwzr/logs (aqlm) root@Harvey-4090:/home/harvey/PM/AQLM#
train.sh
export CUDA_VISIBLE_DEVICES=0 # or e.g. 0,1,2,3 export MODEL_PATH=/home/harvey/PM/modelscope/hub/qwen/Qwen2.5-7B-Instruct export DATASET_PATH=/home/harvey/PM/data/4096.pth export SAVE_PATH=/home/harvey/PM/ export WANDB_PROJECT=MY_AQ_EXPS export WANDB_NAME=COOL_EXP_NAME
python main.py $MODEL_PATH $DATASET_PATH
--nsamples=256
--val_size=32
--num_codebooks=1
--nbits_per_codebook=16
--in_group_size=8
--relative_mse_tolerance=0.01
--finetune_batch_size=32
--finetune_max_epochs=10
--finetune_early_stop=3
--finetune_keep_best
--local_batch_size=1
--offload_activations
--wandb
--resume
--save $SAVE_PATH
Hi, @Jun-Howie !
This looks very suspicious. You need more information to understand what the problem is. Can you send a screenshot of nvidia-smi or nvtop when you run bash train.sh again?
nvtop command No GPU to monitor. I have a feeling that WSL2 is not recognizing the GPU correctly. nvitop or nvidia-smi can read the GPU status correctly. I may need to fix the system environment configuration first Thanks for your help!
Hi, @Jun-Howie ! I see what the problem is.
You're using the WSL. You may find these sources helpful:
- official documentation on WSL limitations
- Does your WSL Linux allow you to pin more than 2GB of memory?.
Line 106 of main.py creates an inps tensor of size [256, 4096, 3584] (256 is the nsamples=256 value you specified, 4096 is the model_seqlen value, 3584 is the hidden_size value of the Qwen2.5-7B model.) with data type bfloat16, using pin_memory=True because you used the --offload_activations switch. In RAM, this tensor would take 256 * 4096 * 3584 * 2 / 1024 / 1024 = 7168 Mb. However, pin_memory=True is used, which will allocate some memory to the GPU.
You can do something simple to test in the same environment. For example
import torch
x = torch.zeros(256, 4096, 3584, pin_memory=True)
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.