RuntimeError: Storage size calculation overflowed with sizes=[1, 4623015400198258675]
System Info
- `Accelerate` version: 0.31.0
- Platform: Linux-3.10.0-1160.83.1.0.1.el7.x86_64-x86_64-with-glibc2.17
- `accelerate` bash location: /data/......./venv/bin/accelerate
- Python version: 3.11.5
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 2015.46 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- debug: True
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [X] My own task or dataset (give details below)
Reproduction
I'm running Accelerate with TRL to train LLAMA 3 70B. It breaks with the above exception when it somehow gets huge max size inside PPOTrainer.step(). Any hints for what is wrong? Thanks
Main code:
accelerator = Accelerator(
kwargs_handlers=[
InitProcessGroupKwargs(timeout=timedelta(minutes=30), backend="nccl")
]
)
torch.cuda.empty_cache()
device = accelerator.device
# Load model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
context.model_name,
peft_config=lora_config,
attn_implementation="sdpa",
)
# Tokenizer:
tokenizer = AutoTokenizer.from_pretrained(context.model_name)
tokenizer.pad_token_id = tokenizer.eos_token_id
# Refs: llama-recipes: src/llama_recipes/finetuning.py
# If there is a mismatch between tokenizer vocab size and embedding matrix,
# throw a warning and then expand the embedding matrix
if len(tokenizer) > model.pretrained_model.get_input_embeddings().weight.shape[0]:
print(
"WARNING: Resizing the embedding matrix to match the tokenizer vocab size."
)
model.pretrained_model.resize_token_embeddings(len(tokenizer))
context.generation_kwargs |= dict(
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
model.eval()
with accelerator.main_process_first():
steg_ds = build_dataset(context, tokenizer=tokenizer)
accelerator.wait_for_everyone()
ppo_trainer = PPOTrainer(
context.ppo_config,
model,
ref_model=None,
tokenizer=tokenizer,
dataset=steg_ds["train"],
data_collator=collator,
)
tokenizer = ppo_trainer.tokenizer # type: ignore # It is changed by PPOTrainer.
model = ppo_trainer.model # type: ignore
accelerator.wait_for_everyone()
output_length_sampler = LengthSampler(
context.output_min_length, context.output_max_length
)
dataloader: torch.utils.data.DataLoader = ppo_trainer.dataloader # type: ignore
for epoch in tqdm(range(context.epoch_num)):
for batch in dataloader:
question_tensors = batch["input_ids"]
batch["response"] = []
response_tensors = []
for input_ids in question_tensors: # TODO: make it batched.
response_ids = ppo_trainer.model.generate(
input_ids.unsqueeze(0), # Add batch dimention.
max_new_tokens=context.output_max_length,
**context.generation_kwargs,
)
# Take only response:
response_ids = response_ids[..., input_ids.shape[-1] :]
decoded = tokenizer.batch_decode(response_ids)
batch["response"].append(decoded[0])
response_tensors.append(response_ids[0])
# Compute reward score:
rewards, caught_num, decoded_num, success_num = reward_batch(
batch, ppo_trainer, tokenizer, device, context
)
# Run PPO step
# Log shapes of question_tensors and response_tensors
for inx, q, response, reward in zip(
range(len(question_tensors)),
question_tensors,
response_tensors,
rewards,
):
stats = ppo_trainer.step(question_tensors, response_tensors, rewards) # type: ignore
b_len = len(batch["response"])
stats["train/decoder_rate"] = decoded_num / b_len
stats["train/caught_rate"] = caught_num / b_len
stats["train/success_rate"] = success_num / b_len
stats["dl/epoch"] = epoch
ppo_trainer.log_stats( # type: ignore
stats,
batch,
rewards,
columns_to_log=["bit", "query", "response"],
)
Accelerate config:
compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
The bug:
> 2024-06-18 08:27:17,705::40022__main__:DEBUG Before PPO step
> [rank0]:[E618 08:57:17.122997522 ProcessGroupNCCL.cpp:572] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82098, OpType=_ALLGATHER_BASE, NumelIn=262668288, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800028
> milliseconds before timing out.
> [rank0]:[E618 08:57:17.124263345 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 82098, last enqueued NCCL work: 82098, last completed NCCL work: 82097.
> [rank3]:[E618 08:57:17.135739202 ProcessGroupNCCL.cpp:572] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82098, OpType=_ALLGATHER_BASE, NumelIn=262668288, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800040 milliseconds before timing out.
> [rank3]:[E618 08:57:17.136141133 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 82098, last enqueued NCCL work: 82098, last completed NCCL work: 82097.
> [rank2]:[E618 08:57:17.145912074 ProcessGroupNCCL.cpp:572] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82098, OpType=_ALLGATHER_BASE, NumelIn=262668288, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800050 milliseconds before timing out.
> [rank2]:[E618 08:57:17.146319125 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 82098, last enqueued NCCL work: 82098, last completed NCCL work: 82097.
> [rank2]:[E618 08:57:17.513319285 ProcessGroupNCCL.cpp:1632] [PG 0 (default_pg) Rank 2] Timeout at NCCL work: 82098, last enqueued NCCL work: 82098, last completed NCCL work: 82097.
> [rank2]:[E618 08:57:17.513661404 ProcessGroupNCCL.cpp:586] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
> [rank2]:[E618 08:57:17.513921151 ProcessGroupNCCL.cpp:592] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
> [rank2]:[E618 08:57:17.515218876 ProcessGroupNCCL.cpp:1448] [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82098, OpType=_ALLGATHER_BASE, NumelIn=262668288, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800050 milliseconds before timing out.
> Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first):
> frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc8e0788de6 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
> frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc8e1a2f8f2 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
> frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7fc8e1a35f67 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
> frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc8e1a37d6c in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
> frame #4: <unknown function> + 0xdbbf4 (0x7fc92e6dfbf4 in /data/artyom_karpov/miniconda3/bin/../lib/libstdc++.so.6)
> frame #5: <unknown function> + 0x7ea5 (0x7fc9368bbea5 in /lib64/libpthread.so.0)
> frame #6: clone + 0x6d (0x7fc935edbb2d in /lib64/libc.so.6)
>
> [rank1]:[E618 08:57:17.650442757 ProcessGroupNCCL.cpp:572] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82098, OpType=_ALLGATHER_BASE, NumelIn=2, NumelOut=8, Timeout(ms)=1800000) ran for 1800085 milliseconds before timing out.
> [rank1]:[E618 08:57:17.650866648 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 82098, last enqueued NCCL work: 82098, last completed NCCL work: 82097.
> ^M 0%| | 0/200 [31:11<?, ?it/s]
> [rank1]: Traceback (most recent call last):
> [rank1]: File "/data/artyom_karpov/rl4steg/train.py", line 518, in <module>
> [rank1]: main(context)
> [rank1]: File "/data/artyom_karpov/rl4steg/train.py", line 264, in main
> [rank1]: stats = ppo_trainer.step(question_tensors, response_tensors, rewards) # type: ignore
> [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> [rank1]: File "/data/artyom_karpov/miniconda3/lib/python3.11/contextlib.py", line 81, in inner
> [rank1]: return func(*args, **kwds)
> [rank1]: ^^^^^^^^^^^^^^^^^^^
> [rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 712, in step
> [rank1]: model_inputs["input_ids"] = self.accelerator.pad_across_processes(
> [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> [rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 2473, in pad_across_processes
> [rank1]: return pad_across_processes(tensor, dim=dim, pad_index=pad_index, pad_first=pad_first)
> [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> [rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 414, in wrapper
> [rank1]: return function(*args, **kwargs)
> [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
> [rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 681, in pad_across_processes
> [rank1]: return recursively_apply(
> [rank1]: ^^^^^^^^^^^^^^^^^^
> [rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 126, in recursively_apply
> [rank1]: return func(data, *args, **kwargs)
> [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> [rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 671, in _pad_across_processes
> [rank1]: new_tensor = tensor.new_zeros(tuple(new_size)) + pad_index
> [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> [rank1]: RuntimeError: Storage size calculation overflowed with sizes=[1, 4623015400198258675]
Expected behavior
It does the PPOTrainer.step() successfully.
Or it fails with CUDA OOM:
[rank1]: Traceback (most recent call last):
[rank1]: File "/data/artyom_karpov/rl4steg/train.py", line 530, in <module>
[rank1]: main(context)
[rank1]: File "/data/artyom_karpov/rl4steg/train.py", line 276, in main
[rank1]: stats = ppo_trainer.step(question_tensors, response_tensors, rewards) # type: ignore
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/artyom_karpov/miniconda3/lib/python3.11/contextlib.py", line 81, in inner
[rank1]: return func(*args, **kwds)
[rank1]: ^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 712, in step
[rank1]: model_inputs["input_ids"] = self.accelerator.pad_across_processes(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 2482, in pad_across_processes
[rank1]: return pad_across_processes(tensor, dim=dim, pad_index=pad_index, pad_first=pad_first)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 414, in wrapper
[rank1]: return function(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 682, in pad_across_processes
[rank1]: return recursively_apply(
[rank1]: ^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 126, in recursively_apply
[rank1]: return func(data, *args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 662, in _pad_across_processes
[rank1]: sizes = gather(size).cpu()
[rank1]: ^^^^^^^^^^^^
[rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 390, in wrapper
[rank1]: output = gather_object([shapes])
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 465, in gather_object
[rank1]: return _gpu_gather_object(object)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 446, in _gpu_gather_object
[rank1]: torch.distributed.all_gather_object(output_objects, object)
[rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2439, in all_gather_object
[rank1]: input_tensor.resize_(max_object_size)
[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
did you get the correct answer? I have the same problem.
did you get the correct answer? I have the same problem.
@Dai0-2 No, I didn't find a solution to it. I think, I should have tried to supply padded tensors there, i.e. tensors of the same size. It can be padding on the left so all tensors have queries to the right. Then responses can be concatenated. Perhaps that can avoid this padding across devices that results in this exception.
I met the same error when use accelerate in evaluating the logits.
I met the same error when using accelerator to evaluate the logits. I am using RewardTrainer
Same error when use Seq2seq trainer
This should be reopened
same
same error