gpt-neo
gpt-neo copied to clipboard
Cannot Connect To Local TPU-VM
Describe the bug When I try to connect to the TPU to finetune, it gives me this error:
Traceback (most recent call last):
File "main.py", line 257, in <module>
main(args)
File "main.py", line 251, in main
estimator.train(input_fn=partial(input_fn, global_step=current_step, eval=False), max_steps=params["train_steps"])
File "/home/nikhilnayak/.local/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3110, in train
rendezvous.raise_errors()
File "/home/nikhilnayak/.local/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors
six.reraise(typ, value, traceback)
File "/home/nikhilnayak/.local/lib/python3.8/site-packages/six.py", line 703, in reraise
raise value
File "/home/nikhilnayak/.local/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3100, in train
return super(TPUEstimator, self).train(
File "/home/nikhilnayak/.local/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 346, in train
hooks.extend(self._convert_train_steps_to_hooks(steps, max_steps))
File "/home/nikhilnayak/.local/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2973, in _convert_train_steps_to_hooks
if ctx.is_running_on_cpu():
File "/home/nikhilnayak/.local/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 531, in is_running_on_cpu
self._validate_tpu_configuration()
File "/home/nikhilnayak/.local/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 699, in _validate_tpu_configuration
num_cores = self.num_cores
File "/home/nikhilnayak/.local/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 429, in num_cores
metadata = self._get_tpu_system_metadata()
File "/home/nikhilnayak/.local/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 333, in _get_tpu_system_metadata
tpu_system_metadata_lib._query_tpu_system_metadata(
File "/home/nikhilnayak/.local/lib/python3.8/site-packages/tensorflow/python/tpu/tpu_system_metadata.py", line 135, in _query_tpu_system_metadata
raise RuntimeError(
RuntimeError: Cannot find any TPU cores in the system (master address ). This usually means the master address is incorrect or the TPU worker has some problems. Available devices: [_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, -3188567715276368833)]
To Reproduce Steps to reproduce the behavior: I followed the instructions for finetuning on this github page.
Expected behavior The finetuning program should finetune with my dataset without datasets.
Proposed solution N/A
Environment (please complete the following information):
- TPU Version: v2-alpha
- TPU Type: v3-8
- Architecture: TPU-VM
This codebase is not actively maintained and was created before TPU VMs existed. You’ll probably have to figure it out yourself, unfortunately.
You may want to check out Mesh Transformer Jax, which is a more actively maintained Jax-based TOU framework