BERT-NER
BERT-NER copied to clipboard
how to use gpu trainning
hello zhou, how to use gpu to train. when the tpu is set to false, then the training defaultly use cpu
it can automatically use GPU!
@yaoyao3 check pip list
, make sure you only have tensorflow-gpu in it.
If you both have tensorflow & tensorflow-gpu, tf will choose cpu in default, at least in my case.
@kyzhouhzau I set --use_tpu=True but it doesn't see gpu
I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
INFO:tensorflow:Failed to find TPU: _TPUSystemMetadata(num_cores=0, num_hosts=0, num_of_cores_per_host=0, topology=None, devices=(_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 13701800340323951977), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 9332358750793188078)))
I1106 11:04:27.689568 140625411577664 tpu_system_metadata.py:156] Failed to find TPU: _TPUSystemMetadata(num_cores=0, num_hosts=0, num_of_cores_per_host=0, topology=None, devices=(_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 13701800340323951977), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 9332358750793188078)))
ERROR:tensorflow:Error recorded from training_loop: Cannot find any TPU cores in the system. Please double check Tensorflow master address and TPU worker(s). Available devices are (_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 13701800340323951977), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 9332358750793188078)).
E1106 11:04:27.690066 140625411577664 error_handling.py:70] Error recorded from training_loop: Cannot find any TPU cores in the system. Please double check Tensorflow master address and TPU worker(s). Available devices are (_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 13701800340323951977), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 9332358750793188078)).
INFO:tensorflow:training_loop marked as finished
I1106 11:04:27.690162 140625411577664 error_handling.py:96] training_loop marked as finished
WARNING:tensorflow:Reraising captured error
W1106 11:04:27.690225 140625411577664 error_handling.py:130] Reraising captured error
Traceback (most recent call last):
File "BERT_NER.py", line 697, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "BERT_NER.py", line 631, in main
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2876, in train
rendezvous.raise_errors()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 131, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 364, in train
hooks.extend(self._convert_train_steps_to_hooks(steps, max_steps))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2746, in _convert_train_steps_to_hooks
if ctx.is_running_on_cpu():
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 442, in is_running_on_cpu
self._validate_tpu_configuration()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 613, in _validate_tpu_configuration
'are {}.'.format(tpu_system_metadata.devices))
RuntimeError: Cannot find any TPU cores in the system. Please double check Tensorflow master address and TPU worker(s). Available devices are (_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 13701800340323951977), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 9332358750793188078)).
processed 40610 tokens with 4671 phrases; found: 4557 phrases; correct: 4104.
@paulthemagno i think tpu is different from gpu, leave that option be false.
@kyzhouhzau I tried to set --use_tpu=False and it sees GPU, but only allocates a fixed quantity of memory on it (about 100Mib). Anyway all the computation is executed on CPU.