maxtext
maxtext copied to clipboard
Multihost training collapses from time to time when loading the next batch
Hi,
I was testing the multi-host training on a v4-16 TPU VM. The training normally runs smoothly, but sometimes, it collapses at load_next_batch
with the following error from the process 0:
completed step: 80041, seconds: 0.448, TFLOP/s/device: 62.454, loss: 3.111
completed step: 80042, seconds: 0.628, TFLOP/s/device: 44.624, loss: 3.115
completed step: 80043, seconds: 0.271, TFLOP/s/device: 103.424, loss: 3.052
completed step: 80044, seconds: 0.447, TFLOP/s/device: 62.600, loss: 3.087
completed step: 80045, seconds: 0.448, TFLOP/s/device: 62.527, loss: 3.099
completed step: 80046, seconds: 0.448, TFLOP/s/device: 62.530, loss: 3.087
completed step: 80047, seconds: 0.448, TFLOP/s/device: 62.492, loss: 3.088
completed step: 80048, seconds: 0.454, TFLOP/s/device: 61.738, loss: 3.092
completed step: 80049, seconds: 0.443, TFLOP/s/device: 63.173, loss: 3.093
completed step: 80050, seconds: 0.448, TFLOP/s/device: 62.510, loss: 3.041
To see full metrics 'tensorboard --logdir=gs://maxtext_multihost_job/gpt2_steps12w/tensorboard/'
I0718 10:27:47.821592 139629892142656 grain_pool.py:398] Grain pool is exiting.
I0718 10:27:47.821762 139629892142656 grain_pool.py:403] Shutting down multiprocessing system.
I0718 10:27:50.149074 139629892142656 grain_pool.py:403] Shutting down multiprocessing system.
Exception ignored in: <Finalize object, dead> [202/1921]
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead> [160/1921]
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead> [118/1921]
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "/home/yfan/maxtext/MaxText/train.py", line 669, in <module>
app.run(main)
File "/home/yfan/.local/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/home/yfan/.local/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/home/yfan/maxtext/MaxText/train.py", line 665, in main
train_loop(config)
File "/home/yfan/maxtext/MaxText/train.py", line 561, in train_loop
example_batch = load_next_batch(data_iterator, example_batch, config)
File "/home/yfan/maxtext/MaxText/train.py", line 94, in load_next_batch
return next(train_iter)
File "/home/yfan/maxtext/MaxText/multihost_dataloading.py", line 119, in __next__
return get_next_batch_sharded(self.local_iterator, self.global_mesh)
File "/home/yfan/maxtext/MaxText/multihost_dataloading.py", line 78, in get_next_batch_sharded
local_data = next(local_iterator)
File "/home/yfan/.local/lib/python3.10/site-packages/grain/_src/python/data_loader.py", line 416, in __next__
result_record = next(self._iterator)
File "/home/yfan/.local/lib/python3.10/site-packages/grain/_src/python/data_loader.py", line 348, in _iterator_with_context
yield from it
File "/home/yfan/.local/lib/python3.10/site-packages/grain/_src/python/grain_pool.py", line 634, in __next__
result = multiprocessing_common.get_async_result(
File "/home/yfan/.local/lib/python3.10/site-packages/grain/_src/python/multiprocessing_common.py", line 81, in get_async_result
return async_result.get(timeout=_ASYNC_RESULT_WAIT_TIMEOUT_SECONDS)
File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
raise self._value
File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds)) [76/1921]
File "/home/yfan/.local/lib/python3.10/site-packages/grain/_src/python/grain_pool.py", line 529, in _open_shared_memory_for_structure
structure.data = tree.map_structure(
File "/home/yfan/.local/lib/python3.10/site-packages/jax/_src/tree_util.py", line 343, in tree_map
return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))
File "/home/yfan/.local/lib/python3.10/site-packages/jax/_src/tree_util.py", line 343, in <genexpr>
return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))
File "/home/yfan/.local/lib/python3.10/site-packages/grain/_src/python/grain_pool.py", line 522, in _open_shared_memory_for_leaf
element = shared_memory_array.SharedMemoryArray.from_metadata(element)
File "/home/yfan/.local/lib/python3.10/site-packages/grain/_src/python/shared_memory_array.py", line 99, in from_metadata
shm = shared_memory.SharedMemory(metadata.name)
File "/usr/lib/python3.10/multiprocessing/shared_memory.py", line 104, in __init__
self._fd = _posixshmem.shm_open(
FileNotFoundError: [Errno 2] No such file or directory: '/psm_dcc9e254'
2024-07-18 10:27:50.743530: I external/xla/xla/pjrt/distributed/client.cc:141] Distributed task shutdown initiated.
E0718 10:32:50.744105 128013 coordination_service_agent.cc:514] Failed to disconnect from coordination service with status: DEADLINE_EXCEEDED: Deadline Exceeded
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/ShutdownTask:
:{"created":"@1721298770.744007696","description":"Error received from peer ipv4:10.130.0.10:8476","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":10
56,"grpc_message":"Deadline Exceeded","grpc_status":4}
Proceeding with agent shutdown anyway. This is usually caused by an earlier error during execution. Check the logs (this task or the leader) for an earlier error to debug further.
2024-07-18 10:32:50.744385: I external/xla/xla/pjrt/distributed/client.cc:143] Distributed task shutdown result: DEADLINE_EXCEEDED: Deadline Exceeded
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/ShutdownTask:
:{"created":"@1721298770.744007696","description":"Error received from peer ipv4:10.130.0.10:8476","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":10
56,"grpc_message":"Deadline Exceeded","grpc_status":4}
2024-07-18 10:32:50.744433: I external/xla/xla/tsl/distributed_runtime/preemption/preemption_sync_manager.cc:168] Cancelled call to retrieve preemption notice. This is expected upon
program shutdown.
Exception ignored in atexit callback: <function shutdown at 0x7f0499b6c820>
Traceback (most recent call last):
File "/home/yfan/.local/lib/python3.10/site-packages/jax/_src/distributed.py", line 208, in shutdown
global_state.shutdown()
File "/home/yfan/.local/lib/python3.10/site-packages/jax/_src/distributed.py", line 110, in shutdown
self.client.shutdown()
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Deadline Exceeded
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/ShutdownTask:
:{"created":"@1721298770.744007696","description":"Error received from peer ipv4:10.130.0.10:8476","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":10
56,"grpc_message":"Deadline Exceeded","grpc_status":4}
Exception ignored in: <function GCSRecordWriter.__del__ at 0x7f03a6ab4af0>
Traceback (most recent call last):
File "/home/yfan/.local/lib/python3.10/site-packages/tensorboardX/record_writer.py", line 134, in __del__
File "/home/yfan/.local/lib/python3.10/site-packages/tensorboardX/record_writer.py", line 158, in close
File "/home/yfan/.local/lib/python3.10/site-packages/tensorboardX/record_writer.py", line 149, in flush
File "/usr/lib/python3.10/copy.py", line 92, in copy
ImportError: sys.meta_path is None, Python is likely shutting down [34/1921]
2024-07-18 10:32:52.026393: I external/xla/xla/tsl/distributed_runtime/preemption/preemption_sync_manager.cc:141] Preemption sync protocol cancelled by notifier: CANCELLED: Preempti
on notifier is being deleted.. This is expected during program shutdown.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 14 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-c21cvcq7': [Errno 2] No such file or directory
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-0364s4ot': [Errno 2] No such file or directory
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-242c58wv': [Errno 2] No such file or directory
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-9neplrq_': [Errno 2] No such file or directory
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-kathvzxa': [Errno 2] No such file or directory
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-2gpbgbge': [Errno 2] No such file or directory
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-w2u838hn': [Errno 2] No such file or directory
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-cjwehx__': [Errno 2] No such file or directory
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-f9iq947x': [Errno 2] No such file or directory
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-z3u4n_51': [Errno 2] No such file or directory
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-n30yaud_': [Errno 2] No such file or directory
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-1ypluvf6': [Errno 2] No such file or directory
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-hndkyutu': [Errno 2] No such file or directory
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-089mqlp9': [Errno 2] No such file or directory
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 25 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_378e39a0': [Errno 2] No such file or directory: '/psm_378e39a0'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_41013556': [Errno 2] No such file or directory: '/psm_41013556'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_c9096684': [Errno 2] No such file or directory: '/psm_c9096684'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_7c888814': [Errno 2] No such file or directory: '/psm_7c888814'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_ae258c84': [Errno 2] No such file or directory: '/psm_ae258c84'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_34e3d518': [Errno 2] No such file or directory: '/psm_34e3d518'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_bebfe91c': [Errno 2] No such file or directory: '/psm_bebfe91c'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_3047051e': [Errno 2] No such file or directory: '/psm_3047051e'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_34c56457': [Errno 2] No such file or directory: '/psm_34c56457'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_f2c937a3': [Errno 2] No such file or directory: '/psm_f2c937a3'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_0984c489': [Errno 2] No such file or directory: '/psm_0984c489'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_dcc9e254': [Errno 2] No such file or directory: '/psm_dcc9e254'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_eb74c162': [Errno 2] No such file or directory: '/psm_eb74c162'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_b1671147': [Errno 2] No such file or directory: '/psm_b1671147'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_d52ab94e': [Errno 2] No such file or directory: '/psm_d52ab94e'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_2d943663': [Errno 2] No such file or directory: '/psm_2d943663'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_eda31725': [Errno 2] No such file or directory: '/psm_eda31725'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_7d97f3a7': [Errno 2] No such file or directory: '/psm_7d97f3a7'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_6d256567': [Errno 2] No such file or directory: '/psm_6d256567'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_e0c0e854': [Errno 2] No such file or directory: '/psm_e0c0e854'
warnings.warn('resource_tracker: %r: %s' % (name, e))
The command for running the job is python3 MaxText/train.py MaxText/configs/gpt2.yml run_name=gpt2 base_output_directory=gs://maxtext_multihost_job steps=120000 dataset_type=hf hf_path=YUE-FAN/openwebtext_gcp hf_data_dir=data tokenizer_path=EleutherAI/gpt-neox-20b eval_interval=4000 hf_eval_split=validation enable_checkpointing=True eval_batch_num=558 per_device_batch_size=32 eval_per_device_batch_size=32 checkpoint_period=10000 logits_via_embedding=True normalize_embedding_logits=True
. I have very limited knowledge about Python multiprocessing, but it seems to be a problem related to reading the shared memory? This problem does not always occur, but it happens from time to time. Any assistance here will be appreciated! Thanks!