DeepLearningExamples
DeepLearningExamples copied to clipboard
biobert for tensorflow
when executing the commands listed in this report
and specifically here:
bash scripts/docker/launch.sh
Nothing happens besides these message:
NOTE: MOFED driver for multi-node communication was not detected. Multi-node communication performance may be reduced.
Updating the nvidia driver did not help to get rid of the message
A more serious issue occurs when running phase 1:
cat /results/tf_bert_bio_1n_phase1_cased_false_fp16_gbs0.221011134537.log
2022-10-11 13:45:37.859851: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0 WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod-0.19.1-py3.6-linux-x86_64.egg/horovod/tensorflow/init.py:152: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod-0.19.1-py3.6-linux-x86_64.egg/horovod/tensorflow/init.py:178: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:
- https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
- https://github.com/tensorflow/addons
- https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.
W1011 13:45:39.003334 140380444518208 lazy_loader.py:50] The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:
- https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
- https://github.com/tensorflow/addons
- https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.
WARNING:tensorflow:From /workspace/bert/run_pretraining.py:593: The name tf.enable_resource_variables is deprecated. Please use tf.compat.v1.enable_resource_variables instead.
W1011 13:45:39.425518 140380444518208 module_wrapper.py:139] From /workspace/bert/run_pretraining.py:593: The name tf.enable_resource_variables is deprecated. Please use tf.compat.v1.enable_resource_variables instead.
INFO:tensorflow:Using config: {'_model_dir': '/results/biobert_phase_1', '_tf_random_seed': None, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': graph_options {
optimizer_options {
global_jit_level: ON_1
}
rewrite_options {
memory_optimization: NO_MEM_OPT
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fac16593240>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
I1011 13:45:39.426212 140380444518208 estimator.py:212] Using config: {'_model_dir': '/results/biobert_phase_1', '_tf_random_seed': None, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': graph_options {
optimizer_options {
global_jit_level: ON_1
}
rewrite_options {
memory_optimization: NO_MEM_OPT
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fac16593240>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder.
data/ops/readers.py:336 __init__
filenames, compression_type, buffer_size, num_parallel_reads)
data/ops/readers.py:296 __init__
filenames = _create_or_validate_filenames_dataset(filenames)
data/ops/readers.py:56 _create_or_validate_filenames_dataset
filenames = ops.convert_to_tensor(filenames, dtype=dtypes.string)
framework/ops.py:1184 convert_to_tensor
return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
framework/ops.py:1242 convert_to_tensor_v2
as_ref=False)
framework/ops.py:1273 internal_convert_to_tensor
(dtype.name, value.dtype.name, value))
ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: <tf.Tensor 'args_0:0' shape=() dtype=float32>