streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Dataset does not work after stopping training

Open gluonfield opened this issue 5 months ago • 1 comments

I am trying to use Streaming dataset and Streaming data loader, but it almost never works after crash/stopped training.

A typical scenario

  1. Start training on multi GPU
  2. Cancel training with ctrl+C. Make sure nvidia-smi is clean.
  3. Start training again

It fails with

[rank7]:   File "/opt/conda/lib/python3.10/site-packages/streaming/base/dataset.py", line 529, in __init__
[rank7]:     self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
[rank7]:   File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
[rank7]:     shm = SharedMemory(name, True, len(data))
[rank7]:   File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/memory.py", line 41, in __init__
[rank7]:     shm = BuiltinSharedMemory(name, create, size)
[rank7]:   File "/opt/conda/lib/python3.10/multiprocessing/shared_memory.py", line 104, in __init__
[rank7]:     self._fd = _posixshmem.shm_open(
[rank7]: FileExistsError: [Errno 17] File exists: '/000010_locals'
[rank: 7] Child process with PID 37032 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py

At this point I am not able to do anything. I then try to include

streaming.base.util.clean_stale_shared_memory()

at the beginning of my script, however this leads to even different error

  File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 61, in <module>
    main()
  File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 55, in main
    streaming.base.util.clean_stale_shared_memory()
  File "/opt/conda/lib/python3.10/site-packages/streaming/base/util.py", line 176, in clean_stale_shared_memory
    destroy_dist = maybe_init_dist()
  File "/opt/conda/lib/python3.10/site-packages/streaming/base/distributed.py", line 128, in maybe_init_dist
    dist.init_process_group(backend=backend, rank=get_rank(), world_size=get_world_size())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
    func_return = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1361, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 258, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 185, in _create_c10d_store
    return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use
[rank: 1] Child process with PID 47744 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟

We've been heavily relying to be able to migrate to Mosaic, however we just can't use it productively. Any ideas what could be wrong?

Environment

  • OS: Debian 11
  • Hardware: H100 GCP
  • Driver Version: 550.90.07
  • CUDA Version: 12.4

gluonfield avatar Sep 15 '24 21:09 gluonfield