streaming
streaming copied to clipboard
Dataset does not work after stopping training
I am trying to use Streaming dataset and Streaming data loader, but it almost never works after crash/stopped training.
A typical scenario
- Start training on multi GPU
- Cancel training with
ctrl+C
. Make surenvidia-smi
is clean. - Start training again
It fails with
[rank7]: File "/opt/conda/lib/python3.10/site-packages/streaming/base/dataset.py", line 529, in __init__
[rank7]: self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
[rank7]: File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
[rank7]: shm = SharedMemory(name, True, len(data))
[rank7]: File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/memory.py", line 41, in __init__
[rank7]: shm = BuiltinSharedMemory(name, create, size)
[rank7]: File "/opt/conda/lib/python3.10/multiprocessing/shared_memory.py", line 104, in __init__
[rank7]: self._fd = _posixshmem.shm_open(
[rank7]: FileExistsError: [Errno 17] File exists: '/000010_locals'
[rank: 7] Child process with PID 37032 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py
At this point I am not able to do anything. I then try to include
streaming.base.util.clean_stale_shared_memory()
at the beginning of my script, however this leads to even different error
File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 61, in <module>
main()
File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 55, in main
streaming.base.util.clean_stale_shared_memory()
File "/opt/conda/lib/python3.10/site-packages/streaming/base/util.py", line 176, in clean_stale_shared_memory
destroy_dist = maybe_init_dist()
File "/opt/conda/lib/python3.10/site-packages/streaming/base/distributed.py", line 128, in maybe_init_dist
dist.init_process_group(backend=backend, rank=get_rank(), world_size=get_world_size())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
func_return = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1361, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 258, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 185, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use
[rank: 1] Child process with PID 47744 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
We've been heavily relying to be able to migrate to Mosaic, however we just can't use it productively. Any ideas what could be wrong?
Environment
- OS: Debian 11
- Hardware: H100 GCP
- Driver Version: 550.90.07
- CUDA Version: 12.4