bert4keras
bert4keras copied to clipboard
Failed to use horovod for multi-gpu training
提问时请尽可能提供如下信息:
基本信息
- 你使用的操作系统: ubuntu 20.04
- 你使用的Python版本: 3.8
- 你使用的Tensorflow版本: 2.8.0
- 你使用的Keras版本: 2.8.0
- 你使用的bert4keras版本: 0.11.3
- 你使用纯keras还是tf.keras: keras
- 你加载的预训练模型:
核心代码
参考:https://horovod.readthedocs.io/en/stable/keras.html
输出信息
Epoch 1/100000
/usr/local/lib/python3.8/dist-packages/horovod/_keras/callbacks.py:58: UserWarning: Some callbacks may not have access to the averaged metrics, see https://github.com/horovod/horovod/issues/2440
warnings.warn(
Traceback (most recent call last):
File "train.py", line 205, in <module>
train_model.fit(
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation model_1/Embedding-Token/embedding_lookup: Could not satisfy explicit device specification '' because the node {{colocation_node model_1/Embedding-Token/embedding_lookup}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0].
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
自我尝试
尝试了tf 1.15和tf 2.8,同样的错误。
bert4keras只是keras的一个上层库,相当于帮用户提前写好一些keras代码。
能不能用horovod、怎么用horovod,这是keras的问题,开发者对表示对此也没有研究,所以您应该是提问错地方了。。。要不去horovod处提问,要不去keras处提问才对。。。
把优化器替换成keras.optimizers.Adam之后,没有问题了
# from bert4keras.optimizers import Adam