streaming icon indicating copy to clipboard operation
streaming copied to clipboard

'File exists: "/00000_locals"' when integrated with deepspeed training scripts

Open Clement25 opened this issue 1 year ago • 4 comments

Environment

  • OS: [Ubuntu 22.04.2 LTS]
  • Hardware (GPU, or instance type): [A800]

To reproduce

Steps to reproduce the behavior:

  1. pip install deepspeed
  2. deepspeed train.py ... (training arguments are omitted)

Expected behavior

[2024-07-08 15:29:47]   File "/mnt/data/weihan/projects/cepe/data.py", line 226, in load_streams
[2024-07-08 15:29:47]     self.encoder_decoder_dataset = StreamingDataset(streams=streams, epoch_size=self.epoch_size, allow_unsafe_types=True)
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/site-packages/streaming/base/dataset.py", line 513, in __init__
[2024-07-08 15:29:47]     self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
[2024-07-08 15:29:47]     shm = SharedMemory(name, True, len(data))
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/site-packages/streaming/base/shared/memory.py", line 41, in __init__
[2024-07-08 15:29:47]     shm = BuiltinSharedMemory(name, create, size)
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/multiprocessing/shared_memory.py", line 104, in __init__
[2024-07-08 15:29:47]     self._fd = _posixshmem.shm_open(
[2024-07-08 15:29:47] FileExistsError: [Errno 17] File exists: '/000000_locals'

Additional context

Clement25 avatar Jul 08 '24 07:07 Clement25

Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?

snarayan21 avatar Jul 09 '24 18:07 snarayan21

Hey @snarayan21 ! :) I've tried this to no avail. I also downgraded mosaicml and deepspeed versions. Let me know if you have any other suggestion(s). I'm using A100s.

sukritipaul5 avatar Jul 09 '24 22:07 sukritipaul5

Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?

I tried but it didn't work.

Clement25 avatar Jul 10 '24 15:07 Clement25

Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?

I solved by setting env variable "LOCAL_WORLD_SIZE=$NUM_GPU"

Clement25 avatar Jul 19 '24 09:07 Clement25