Batchsize=1显存不足
- 版本、环境信息 1)PaddleNLP 2.3,PaddlePaddle2.3 2)系统环境:Linux,python3.7 3)batch_size=1,max_seq_lenth=512,train600条,test200条,dev200条
`# 模型训练: import paddle.nn.functional as F import time
save_dir = "checkpoint/bert-wwm" if not os.path.exists(save_dir): os.makedirs(save_dir)
save_train_result = "./results/bert-wwm.tsv" train_r_df = pd.DataFrame(data=None, columns=["global_step","epoch","step","loss","acc","time"])
pre_accu=0
accu=0
global_step = 0
epochs = 10
for epoch in range(1, epochs + 1):
for step, batch in enumerate(train_data_loader, start=1):
start = time.time()
input_ids, segment_ids, labels = batch
logits = model(input_ids, segment_ids)
loss = criterion(logits, labels)
probs = F.softmax(logits, axis=1)
correct = metric.compute(probs, labels)
metric.update(correct)
acc = metric.accumulate()
global_step += 1
if global_step % 2 == 0 :
print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc))
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.clear_grad()
#统计运行时间
end = time.time()
train_r_df = train_r_df.append({"global_step":global_step, "epoch":epoch,"step":step,"loss":loss,"acc":acc,"time":end-start},ignore_index=True)
# 每轮结束对验证集进行评估
accu = evaluate(model, criterion, metric, dev_data_loader)
print(accu)
if accu > pre_accu:
# 保存较上一轮效果更优的模型参数
save_param_path = os.path.join(save_dir, 'model_state.pdparams') # 保存模型参数
paddle.save(model.state_dict(), save_param_path)
pre_accu=accu
tokenizer.save_pretrained(save_dir)
train_r_df.to_csv(save_train_result, sep="\t", index=False, header=True)`
- 报错信息:
SystemError: (Fatal) Operator dropout raises an paddle::memory::allocation::BadAlloc exception. The exception content is :ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 12.000000MB memory on GPU 0, 39.397339GB memory has been allocated and available memory is only 11.562500MB.
Please check whether there is any other process using GPU 0.
- If yes, please stop them, or start PaddlePaddle on another GPU.
- If no, please decrease the batch size of your model. If the above ways do not solve the out of memory problem, you can try to use CUDA managed memory. The command is
export FLAGS_use_cuda_managed_memory=false. (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:87) . (at /paddle/paddle/fluid/imperative/tracer.cc:307)
使用的模型是bert-wwm-chinese,albert-chinese-tiny,skep_ernie_1.0_large_ch
您的显存有多大呢,换一张卡试试?
您的显存有多大呢,换一张卡试试?
This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。
This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。
