streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Make caching location optional.

Open PengWenChen opened this issue 1 year ago • 9 comments

Thanks for your great work! Could this saving cache path be optional instead of always writes into /tmp/streaming ? https://github.com/mosaicml/streaming/blob/cb8e872359643fa84782c4e95c496dc66e495c44/streaming/base/dataset.py#L515

PengWenChen avatar Dec 22 '23 09:12 PengWenChen

Ah, you have a point there.

Would it be enough to use the official temp root of your operating system (say, os.path.join(tempfile.gettempdir(), 'streaming') IIRC?)

If not, what's your use case so we may better understand?

knighton avatar Dec 24 '23 07:12 knighton

Hi @knighton. Thanks for your reply! My working environment cannot access as root account. If one of my partner runs your package first and somehow generates some cache files, such as /tmp/streaming/000000_barrier_filelock or /000000_locals, he or she cannot modified the read and write permissions of these generated files. Thus, others who also want to run the same scripts could not access those files written under root directory. Then PermissionError happens here. Also, writing files under root directory is not allowed in my working space :(.

I have tried to modified self._filelock_root to my local directory under my account. However, the process will stuck at the start of training. So, I would like to ask if I can modify self._filelock_root. If so, what else should I modify? Thank you.

PengWenChen avatar Dec 25 '23 01:12 PengWenChen

Hi @knighton. I found another root path here: https://github.com/mosaicml/streaming/blob/main/streaming/base/stream.py#L166

After modifying both self._filelock_root in dataset.py and root in stream.py, the scripts can successfully executed! But I still want to confirm the correctness with you. Is there anything else I need to change? Thank you.

PengWenChen avatar Dec 25 '23 02:12 PengWenChen

Hi there! The sharedmemory seems not be cleaned up successfully by the first or other users and then the error occurs: Permission denied: '/00000_locals'.

I also found another closed issue: https://github.com/mosaicml/streaming/issues/429 is the same issue I met here. But it seems not being solved.

PengWenChen avatar Dec 26 '23 02:12 PengWenChen

After modifying both self._filelock_root in dataset.py and root in stream.py, the scripts can successfully executed! But I still want to confirm the correctness with you. Is there anything else I need to change? Thank you.

I think you've fully gotten it (which is further supported by it running successfully), and also you are perfectly safe in making those changes you mentioned. Probably wise to have asked, as StreamingDataset has grown a bit complicated...

As you probably noticed, the temp root path in stream.py only comes into play when you do not provide local argument, letting it randomly generate a local for you.

The sharedmemory seems not be cleaned up successfully by the first or other users and then the error occurs: Permission denied: '/00000_locals'.

Leftover shared memory between runs that you have to clean up yourself is unfortunately something that happens when Python processes using shared memory die badly. We have gradually improved on this front over time, but it's far from solved completely. In the meantime, you can manually clear any stale shared memory objects by calling this method: https://github.com/mosaicml/streaming/blob/main/streaming/base/util.py#L169

If you meant you are already calling that method and encountering that permissions problem, that is all relating to the fact that Streaming was originally built to be run on ephemeral training jobs as root. On startup, StreamingDataset replicas do some registration and safety checks to prevent different runs clobbering the same dirs. Unfortunately, this was built for a world where everyone can write files to shared memory and these files can be read back by anyone.

As the original author of that difficult piece of nonsense, let me tell you that combined with patching filelock root, I think if you simply disable the checks and are just very careful about concurrent training jobs (and zombie processes thereof), you will be able to run this as non-root just fine.

knighton avatar Dec 26 '23 07:12 knighton

Specifically, you could remove this bit of code: https://github.com/mosaicml/streaming/blob/cb8e872359643fa84782c4e95c496dc66e495c44/streaming/base/dataset.py#L509-L514

And replace it with something like

self._shm_prefix_int = 8  # I'm feeling lucky

If on Mac OSX, shmem paths need to be quite short. I believe Linux and the like are not so limited.

knighton avatar Dec 26 '23 07:12 knighton

Hi @Skylion007. Thanks for your reply. I haven't tried this to force destroy the leftover shared memory. If I encounter the same problem in the future, I will give it a try. https://github.com/mosaicml/streaming/blob/main/streaming/base/util.py#L169

Now I can execute without shared memory permission denied issue that mentioned above by changing the constant names directly.. (for example: Add some ID to all the constants, like localsPW.) Though its a very aggressive approach but it works... If there are any risks with this approach, please let me know! https://github.com/mosaicml/streaming/blob/main/streaming/base/constant.py

And thanks for the advice to remove code. I will try it.

PengWenChen avatar Dec 26 '23 08:12 PengWenChen

Is your /tmp or equivalent directory world-readable and writeable? I'm thinking if we switched to files for registration and ensured 777 perms, it should be fine cross-user?

If there are any risks with this approach, please let me know!

You are clear.

knighton avatar Dec 26 '23 08:12 knighton

For these path modification: Yes. My path is readable and writeable, and I can change the permission by chmod either. https://github.com/mosaicml/streaming/issues/546#issuecomment-1868672842

As for /000000_locals issue, I failed to change the saving directory of shm (build from Python Multiprocessing), so I changed the saving name instead -> Change all the constant name by adding some ID that refers to me.

PengWenChen avatar Dec 26 '23 09:12 PengWenChen