streaming
streaming copied to clipboard
'File exists: "/00000_locals"' when integrated with deepspeed training scripts
Environment
- OS: [Ubuntu 22.04.2 LTS]
- Hardware (GPU, or instance type): [A800]
To reproduce
Steps to reproduce the behavior:
- pip install deepspeed
- deepspeed train.py ... (training arguments are omitted)
Expected behavior
[2024-07-08 15:29:47] File "/mnt/data/weihan/projects/cepe/data.py", line 226, in load_streams
[2024-07-08 15:29:47] self.encoder_decoder_dataset = StreamingDataset(streams=streams, epoch_size=self.epoch_size, allow_unsafe_types=True)
[2024-07-08 15:29:47] File "/opt/conda/envs/cepe/lib/python3.10/site-packages/streaming/base/dataset.py", line 513, in __init__
[2024-07-08 15:29:47] self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
[2024-07-08 15:29:47] File "/opt/conda/envs/cepe/lib/python3.10/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
[2024-07-08 15:29:47] shm = SharedMemory(name, True, len(data))
[2024-07-08 15:29:47] File "/opt/conda/envs/cepe/lib/python3.10/site-packages/streaming/base/shared/memory.py", line 41, in __init__
[2024-07-08 15:29:47] shm = BuiltinSharedMemory(name, create, size)
[2024-07-08 15:29:47] File "/opt/conda/envs/cepe/lib/python3.10/multiprocessing/shared_memory.py", line 104, in __init__
[2024-07-08 15:29:47] self._fd = _posixshmem.shm_open(
[2024-07-08 15:29:47] FileExistsError: [Errno 17] File exists: '/000000_locals'
Additional context
Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?
Hey @snarayan21 ! :) I've tried this to no avail. I also downgraded mosaicml and deepspeed versions. Let me know if you have any other suggestion(s). I'm using A100s.
Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to
streaming.base.util.clean_stale_shared_memory()and see if that addresses the issue?
I tried but it didn't work.
Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to
streaming.base.util.clean_stale_shared_memory()and see if that addresses the issue?
I solved by setting env variable "LOCAL_WORLD_SIZE=$NUM_GPU"