DeepLearningExamples icon indicating copy to clipboard operation
DeepLearningExamples copied to clipboard

biobert for tensorflow

Open joepareti54 opened this issue 2 years ago • 0 comments

when executing the commands listed in this report and specifically here: bash scripts/docker/launch.sh Nothing happens besides these message: NOTE: MOFED driver for multi-node communication was not detected. Multi-node communication performance may be reduced. Updating the nvidia driver did not help to get rid of the message

A more serious issue occurs when running phase 1:

cat /results/tf_bert_bio_1n_phase1_cased_false_fp16_gbs0.221011134537.log

2022-10-11 13:45:37.859851: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0 WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod-0.19.1-py3.6-linux-x86_64.egg/horovod/tensorflow/init.py:152: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod-0.19.1-py3.6-linux-x86_64.egg/horovod/tensorflow/init.py:178: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

  • https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  • https://github.com/tensorflow/addons
  • https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.

W1011 13:45:39.003334 140380444518208 lazy_loader.py:50] The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

  • https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  • https://github.com/tensorflow/addons
  • https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /workspace/bert/run_pretraining.py:593: The name tf.enable_resource_variables is deprecated. Please use tf.compat.v1.enable_resource_variables instead.

W1011 13:45:39.425518 140380444518208 module_wrapper.py:139] From /workspace/bert/run_pretraining.py:593: The name tf.enable_resource_variables is deprecated. Please use tf.compat.v1.enable_resource_variables instead.

INFO:tensorflow:Using config: {'_model_dir': '/results/biobert_phase_1', '_tf_random_seed': None, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': graph_options { optimizer_options { global_jit_level: ON_1 } rewrite_options { memory_optimization: NO_MEM_OPT } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fac16593240>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} I1011 13:45:39.426212 140380444518208 estimator.py:212] Using config: {'_model_dir': '/results/biobert_phase_1', '_tf_random_seed': None, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': graph_options { optimizer_options { global_jit_level: ON_1 } rewrite_options { memory_optimization: NO_MEM_OPT } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fac16593240>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder..model_fn at 0x7fac165f7158>) includes params argument, but params are not passed to Estimator. W1011 13:45:39.426949 140380444518208 model_fn.py:630] Estimator's model_fn (<function model_fn_builder..model_fn at 0x7fac165f7158>) includes params argument, but params are not passed to Estimator. INFO:tensorflow:***** Running training ***** I1011 13:45:39.427634 140380444518208 run_pretraining.py:630] ***** Running training ***** INFO:tensorflow: Batch size = 128 I1011 13:45:39.427782 140380444518208 run_pretraining.py:631] Batch size = 128 Traceback (most recent call last): File "/workspace/bert/run_pretraining.py", line 723, in tf.compat.v1.app.run() File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/workspace/bert/run_pretraining.py", line 641, in main estimator.train(input_fn=train_input_fn, hooks=training_hooks, max_steps=FLAGS.num_train_steps) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default input_fn, ModeKeys.TRAIN)) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1025, in _get_features_and_labels_from_input_fn self._call_input_fn(input_fn, mode)) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1116, in _call_input_fn return input_fn(**kwargs) File "/workspace/bert/run_pretraining.py", line 513, in input_fn cycle_length=cycle_length)) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 1999, in apply return DatasetV1Adapter(super(DatasetV1, self).apply(transformation_func)) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 1384, in apply dataset = transformation_func(self) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/experimental/ops/interleave_ops.py", line 94, in _apply_fn buffer_output_elements, prefetch_input_elements) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/readers.py", line 226, in init map_func, self._transformation_name(), dataset=input_dataset) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 2722, in init self._function = wrapper_fn._get_concrete_function_internal() File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1853, in _get_concrete_function_internal *args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1847, in _get_concrete_function_internal_garbage_collected graph_function, _, _ = self._maybe_define_function(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 2147, in _maybe_define_function graph_function = self._create_graph_function(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 2038, in _create_graph_function capture_by_value=self._capture_by_value), File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/func_graph.py", line 915, in func_graph_from_py_func func_outputs = python_func(*func_args, **func_kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 2716, in wrapper_fn ret = _wrapper_helper(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 2661, in _wrapper_helper ret = autograph.tf_convert(func, ag_ctx)(*nested_args) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 237, in wrapper raise e.ag_error_metadata.to_exception(e) ValueError: in converted code: relative to /usr/local/lib/python3.6/dist-packages/tensorflow_core/python:

data/ops/readers.py:336 __init__
    filenames, compression_type, buffer_size, num_parallel_reads)
data/ops/readers.py:296 __init__
    filenames = _create_or_validate_filenames_dataset(filenames)
data/ops/readers.py:56 _create_or_validate_filenames_dataset
    filenames = ops.convert_to_tensor(filenames, dtype=dtypes.string)
framework/ops.py:1184 convert_to_tensor
    return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
framework/ops.py:1242 convert_to_tensor_v2
    as_ref=False)
framework/ops.py:1273 internal_convert_to_tensor
    (dtype.name, value.dtype.name, value))

ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: <tf.Tensor 'args_0:0' shape=() dtype=float32>

joepareti54 avatar Oct 10 '22 12:10 joepareti54