distributed Putting `dask-worker-space` in `/tmp` is unfriendly on shared systems

Since d59500ea97c02753eac9d42951d9c4a5d4f17685 (#6658), launching workers can fail on shared systems if someone else happens to be running at the same time (since they will have created /tmp/dask-worker-space and we won't have write permissions).

A friendlier approach would probably be to use tempfile.mkdtemp to create the directory. This has the disadvantage that if a cluster fails and doesn't clean up, a new run would not clean out the old lock directory.

Jul 20 '22 15:07 wence-

It is worth noting that most Linux filesystems provide /tmp via tmpfs, which effectively provides a RAM disk. As a result data that is spilled from memory to tmpfs doesn't actually leave memory. This could cause a lot of churn by Workers trying to spill and free memory without having a meaningful impact on memory pressure.

Jul 20 '22 17:07 jakirkham

@crusaderky any thoughts here? Pinging you as it looks like you pushed https://github.com/dask/distributed/pull/6658 over the finish line

Aug 17 '22 18:08 jrbourbeau

If this causes problems, we can also revert https://github.com/dask/distributed/pull/6658

I still believe CWD is not a good choice for these things but it might be the lesser evil

Aug 18 '22 08:08 fjetter

Yeah was trying to think of what a better default would be, but agree it is tricky. Sometimes the home directory is shared via NFS, which makes it a bad scratch space location.

Some of this is also a documentation issue ( https://github.com/dask/distributed/issues/4000 ). Currently we have this, but it shows how to configure a single worker. Whereas users likely benefit more from knowing about setting "temporary-directory" in dask.config or using the DASK_TEMPORARY_DIRECTORY environment variable to set this as part of shell or container startup. Maybe these options could be listed here?

Aug 18 '22 09:08 jakirkham

Yes, I think the OP about permissions in a shared FS could be handled by better documentation. I think the shared system is a sufficiently rare case that we could point to the documentation and ask users to configure temporary-directory in their dask/distributed.yaml.

What worries me is your comment about the memory mapping since this obviously defeats the purpose of spilling and something like this should not be the default.

Is there a pythonic way to infer whether or not a directory is memory mapped? If so we could try to use /tmp first and fall back to cwd if that's the case.

Aug 18 '22 10:08 fjetter

launching workers can fail on shared systems if someone else happens to be running at the same time (since they will have created /tmp/dask-worker-space and we won't have write permissions).

Right; if writing to /tmp it should be /tmp/dask-worker-space-$USER.

A friendlier approach would probably be to use tempfile.mkdtemp to create the directory.

We're already doing that for the directories inside dask-worker-space.

It is worth noting that most Linux filesystems provide /tmp via tmpfs, which effectively provides a RAM disk.

Not quite - it's a ramdisk backed by the swap file.

Sometimes the home directory is shared via NFS

Or on cloud services it can be an EBS or equivalent. NFS also causes a lot of problems with locks - the test suite used to fail a lot before #6658 if your workspace was on NFS.

I still believe CWD is not a good choice for these things but it might be the lesser evil

I don't think we should revert #6658, due to the test suite. If we do want to go back to CWD, we should

leave Nanny and Worker as they are (default to /tmp)
change dask-worker to default to CWD, e.g. always pass the local_directory parameter
change all tests that start dask-worker to temporarily move CWD to /tmp through a fixture

Aug 18 '22 10:08 crusaderky

I don't think we should revert https://github.com/dask/distributed/pull/6658, due to the test suite. If we do want to go back to CWD, we should

leave Nanny and Worker as they are (default to /tmp) change dask-worker to default to CWD, e.g. always pass the local_directory parameter change all tests that start dask-worker to temporarily move CWD to /tmp through a fixture

I like this proposal but I don't know how this impacts the various deployment tools cc @jacobtomlinson

Aug 18 '22 11:08 fjetter

It is worth noting that most Linux filesystems provide /tmp via tmpfs, which effectively provides a RAM disk.

Not quite - it's a ramdisk backed by the swap file.

That means it would actually swap to disk if memory pressure is high? That would be fine, I guess.

Aug 18 '22 11:08 fjetter

It is worth noting that most Linux filesystems provide /tmp via tmpfs, which effectively provides a RAM disk.

Not quite - it's a ramdisk backed by the swap file.

That means it would actually swap to disk if memory pressure is high? That would be fine, I guess.

Assuming there is a swap file configured.

Aug 18 '22 12:08 wence-

Right; if writing to /tmp it should be /tmp/dask-worker-space-$USER.

This seems like a reasonable default (even in the single-user case) which would avoid the initial problem certainly. What are the downsides compared to sticking with dask-worker-space?

Aug 18 '22 12:08 wence-

Not quite - it's a ramdisk backed by the swap file.

Assuming there is a swap file configured.

If you mounted tmpfs on /tmp but didn't mount a swap partition, your OS is poorly setup and it is not an application problem.

Aug 18 '22 13:08 crusaderky

leave Nanny and Worker as they are (default to /tmp) change dask-worker to default to CWD, e.g. always pass the local_directory parameter

@crusaderky I'm not excited about having different defaults here. Some deployment tools use dask-worker and some use dask-spec which invokes the Nanny, so I wouldn't feel great about the inconsistency.

/tmp/dask-worker-space-$USER

@crusaderky This seems reasonable, although it may break down on containerized systems where $USER resolves to the same thing like root or jovyan for everyone.

I think the shared system is a sufficiently rare case

@fjetter Is it? This seems very common on HPC.

Aug 18 '22 13:08 jacobtomlinson

If you mounted tmpfs on /tmp but didn't mount a swap partition, your OS is poorly setup and it is not an application problem.

While a niche use case for dask (I suspect), this is unfortunately often the default in HPC systems where you don't have control over the OS setup.

Aug 18 '22 13:08 wence-

@crusaderky This seems reasonable, although it may break down on containerized systems where $USER resolves to the same thing like root or jovyan for everyone.

If it's inside the container things should be fine, no?

Aug 18 '22 13:08 wence-

While a niche use case for dask (I suspect), this is unfortunately often the default in HPC systems where you don't have control over the OS setup.

I wouldn't call HPC a niche Dask case.

If it's inside the container things should be fine, no?

Often /tmp is mounted in.

Aug 18 '22 13:08 jacobtomlinson

FWIW, the user survey from last year shows that the 2nd (1st is ssh) most popular way of launching dask is with HPC tooling:

https://blog.dask.org/images/2021_survey/2021_survey_31_0.png
https://blog.dask.org/2021/09/15/user-survey

@jrbourbeau just opened https://github.com/dask/community/issues/269 to get stats for this year

Aug 18 '22 14:08 quasiben

If it's inside the container things should be fine, no?

Often /tmp is mounted in.

Holy loss of segregation, Batman! Jokes aside, won't all the containers share the same user on the host VM?

Aug 18 '22 15:08 crusaderky

It depends on the HPC, if it is using singularity or something similar the usernames will be correct (and therefore unique). But if the container runtime does user namespacing like Docker does then all users will have the same username.

So if they all have the same username, and the same /tmp then /tmp/dask-worker-space-$USER won't help.

Aug 18 '22 15:08 jacobtomlinson

/tmp/dask-worker-space-$UUID then.

Back to tmpfs vs. cwd: I'm very concerned about jupyter notebooks stored on NFS. Such notebooks will spill to NFS when they start a LocalCluster. I think that for the use case of the devbox tmpfs is a better default than cwd. On the flip side, I think it's reasonable to ask users that are doing production deployments to think their parameters through.

Aug 18 '22 15:08 crusaderky

With user namespacing the UUID will also be the same for all users so I'm afraid that doesn't help.

I agree with the concerns about it being CWD, but the concerns around tmpfs are also valid. Neither is a particularly good option for certain (large) groups of our users. I commonly see HPC users set this option to somewhere like /scratch, I rarely see cloud/kubernetes users set this, so maybe /tmp and better documentation is the lesser of two evils here.

An alternative would be to always warn if the user doesn't explicitly set this and try and encourage this to always be set. Or maybe just warn if there are existing dask worker space folders in the default location?

Aug 18 '22 16:08 jacobtomlinson

With user namespacing the UUID will also be the same for all users so I'm afraid that doesn't help.

I meant uuid.uuid4(), not $UID. Sorry for the confusion.

Aug 18 '22 17:08 crusaderky

@fjetter Is it? This seems very common on HPC.

Could probably be caught by fixing this once in dask-jobque. I doubt that there are many HPC users with a custom solution. I was mostly thinking about users that just spin up a cluster using the default settings, i.e. without any configuration or any wrapper, e.g. LocalCluster on a big VM or a laptop, etc.

Just to be clear, I don't have a strong preference here. I introduced this change because I was annoyed about CWD but wasn't aware of any other implications. I'm cool with either solution

Aug 18 '22 17:08 fjetter

HPC users commonly use dask-jobqueue, dask-ssh, dask-mpi and dask-gateway. So if we wanted to handle it downstream those would be the places to do it.

Aug 19 '22 10:08 jacobtomlinson

distributed distributed copied to clipboard

Putting `dask-worker-space` in `/tmp` is unfriendly on shared systems

distributed
distributed copied to clipboard