bert4keras 多gpu 使用MirroredStrategy 出现oom问题 batchsize调小仍然出bug

提问时请尽可能提供如下信息：

基本信息

你使用的操作系统: linux
你使用的Python版本: 3.7.3
你使用的Tensorflow版本: 2.3.1
你使用的Keras版本: 2.3.1
你使用的bert4keras版本: 0.8.8
你使用纯keras还是tf.keras: tf.keras
你加载的预训练模型: bert

核心代码

`strategy = tf.distribute.MirroredStrategy() with strategy.scope():#model.compile之前的放进来https://github.com/bojone/bert4keras/issues/154 #train_model = build_transformer_model_for_pretraining() # 加载预训练模型 bert = build_transformer_model( model='bert', config_path=config_path, checkpoint_path=checkpoint_path, with_pool=True, return_keras_model=False, )

    output = Dropout(rate=0.1)(bert.model.output)
    output = Dense(
        units=2, activation='softmax', kernel_initializer=bert.initializer
    )(output)

    model = keras.models.Model(bert.model.input, output)
    model.summary()
    model.compile(
        loss=custom_loss,
        optimizer=Adam(2e-4),  # 用足够小的学习率
        # optimizer=PiecewiseLinearLearningRate(Adam(5e-5), {10000: 1, 30000: 0.1}),
        metrics=['accuracy'],
    )`

输出信息

# 请在此处贴上你的调试输出

自我尝试

不管什么问题，请先尝试自行解决，“万般努力”之下仍然无法解决再来提问。此处请贴上你的努力过程。有几个问题想咨询下： 1.我在其他代码里(非bert4keras)里看到可以用keras的utils.multi_gpu_model，我尝试过在bert4keras代码下加入，失败了，这种方法能使用吗 2.oom我查阅一些资料后，一种是说batch size调小(已调仍失败)，另外一种是说我的dataset generator也需要设置成适配多gpu的，请问这个需要吗

Jan 21 '21 03:01 lhbrichard

多GPU我没什么经验，也基本没有这个需求，所以实在是不能提出什么有效的意见。我记得多GPU好像是需要将训练数据转为tf.dataset格式的。

Jan 22 '21 03:01 bojone

把batch size继续调小，比如设置为1，看看是否还有OOM的提示了。

Jan 25 '21 07:01 pcibusgood

请试用这个脚本 https://github.com/bojone/bert4keras/blob/master/examples/task_seq2seq_autotitle_multigpu.py

Jan 30 '21 06:01 bojone

使用tf2.1.0，卡在epoch开始的时候起不来，请问你解决了嘛，是需要改tf.dataset吗

Mar 03 '23 08:03 Chouyuhin