ColossalAI
ColossalAI copied to clipboard
[BUG]: RuntimeError: CUDA error: an illegal memory access was encountered
🐛 Describe the bug
I run the bert from huggingface with zero, but get RuntimeError: CUDA error: an illegal memory access was encountered, I found that this problem seemed to be caused by initial_scale in config.py
Traceback (most recent call last):
File "colossalai/run.py", line 463, in
Environment
No response
Could you share me your code?
This usually occurs because of CUDA out-of-memory.
@ver217 this is my code
def trainer(train_dataloader, args, val_dataloader=None):
start_epoch = 0
shard_strategy = TensorShardStrategy()
with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy,
shard_param=True):
config = BertConfig.from_pretrained(args.model_name_or_path, num_labels=200)
model = BertForSequenceClassification.from_pretrained(args.model_name_or_path, config=config)
optimizer = HybridAdam(model.parameters(), weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()
# 开始colossal初始化
engine, train_dataloader, val_dataloader, _ = colossalai.initialize(model,
optimizer,
criterion,
train_dataloader,
val_dataloader,
)
for epoch in range(start_epoch, args.num_epochs):
epoch_loss = 0
train_iter = tqdm(
train_dataloader, desc=f'Epoch:{epoch + 1}', total=len(train_dataloader))
engine.train()
torch.cuda.empty_cache()
for step, inputs in enumerate(train_iter):
labels = inputs['labels'].view(-1).to(args.device)
inputs = {key: inputs[key].to(args.device)
for key in inputs.keys() if key not in ['labels']}
output = engine(inputs['text_input_ids'], attention_mask=inputs['text_mask'])
loss = engine.criterion(output.logits, labels)
engine.backward(loss)
engine.step()
epoch_loss += loss
train_iter.set_postfix_str(
f'loss: {epoch_loss / (step+1):.4f}')
This usually occurs because of CUDA out-of-memory. Yes, after open the zero, seems to have happened the memory overflow, memory growth until oom. and after turning on Zero, colossalai will output inf during zero
I have the same problem, I'm sure the GPU memory is enough
It means that the cuda and the graphics card are not compatible, just replace one of them