PaddleNLP Batchsize=1显存不足

版本、环境信息 1）PaddleNLP 2.3，PaddlePaddle2.3 2）系统环境：Linux，python3.7 3）batch_size=1,max_seq_lenth=512,train600条，test200条，dev200条

`# 模型训练： import paddle.nn.functional as F import time

save_dir = "checkpoint/bert-wwm" if not os.path.exists(save_dir): os.makedirs(save_dir)

save_train_result = "./results/bert-wwm.tsv" train_r_df = pd.DataFrame(data=None, columns=["global_step","epoch","step","loss","acc","time"])

pre_accu=0 accu=0 global_step = 0 epochs = 10 for epoch in range(1, epochs + 1): for step, batch in enumerate(train_data_loader, start=1): start = time.time() input_ids, segment_ids, labels = batch logits = model(input_ids, segment_ids) loss = criterion(logits, labels) probs = F.softmax(logits, axis=1) correct = metric.compute(probs, labels) metric.update(correct) acc = metric.accumulate() global_step += 1 if global_step % 2 == 0 : print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc)) loss.backward() optimizer.step() lr_scheduler.step() optimizer.clear_grad() #统计运行时间 end = time.time() train_r_df = train_r_df.append({"global_step":global_step, "epoch":epoch,"step":step,"loss":loss,"acc":acc,"time":end-start},ignore_index=True) # 每轮结束对验证集进行评估 accu = evaluate(model, criterion, metric, dev_data_loader) print(accu)
if accu > pre_accu: # 保存较上一轮效果更优的模型参数 save_param_path = os.path.join(save_dir, 'model_state.pdparams') # 保存模型参数 paddle.save(model.state_dict(), save_param_path) pre_accu=accu tokenizer.save_pretrained(save_dir) train_r_df.to_csv(save_train_result, sep="\t", index=False, header=True)`

报错信息：

SystemError: (Fatal) Operator dropout raises an paddle::memory::allocation::BadAlloc exception. The exception content is :ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 12.000000MB memory on GPU 0, 39.397339GB memory has been allocated and available memory is only 11.562500MB.

Please check whether there is any other process using GPU 0.

If yes, please stop them, or start PaddlePaddle on another GPU.

If no, please decrease the batch size of your model. If the above ways do not solve the out of memory problem, you can try to use CUDA managed memory. The command is export FLAGS_use_cuda_managed_memory=false. (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:87) . (at /paddle/paddle/fluid/imperative/tracer.cc:307)

Aug 01 '22 08:08 zoeChen119

使用的模型是bert-wwm-chinese,albert-chinese-tiny,skep_ernie_1.0_large_ch

Aug 01 '22 09:08 zoeChen119

您的显存有多大呢，换一张卡试试？

Aug 01 '22 12:08 LiuChiachi

您的显存有多大呢，换一张卡试试？

Aug 05 '22 07:08 zoeChen119

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

Dec 08 '22 02:12 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天，即将关闭。

Dec 22 '22 16:12 github-actions[bot]