ompi icon indicating copy to clipboard operation
ompi copied to clipboard

shmem/posix: change backing filename structure to reduce collisions

Open benmenadue opened this issue 5 years ago • 14 comments

The POSIX shmem component currently uses a filename of the form /open_mpi.nnnn, where nnnn is the "try" count, running from 0 to OPAL_SHMEM_POSIX_MAX_ATTEMPTS - 1. OPAL_SHMEM_POSIX_MAX_ATTEMPTS is currently 128. It tries each of these possibilities in turn until it finds an unused one or runs out (at which point the component disqualifies itself).

Our new HPC system has 48 cores per node, and so this small number is easily exhausted; you just need two all-core runs to not clean up after themselves and you've run out of potential filenames. Or, in the case of our login nodes, a couple of users to accidentally launch there.

Moreover, even on a clean system, at least one process would need to try at least 48 times to find an available filename.

This patch changes the structure of the filenames to include the user's uid and the process's pid, which significantly reduces the chance of a collision, both between the ranks of this MPI job and with any other MPI job that might be running / have run on the same node.

benmenadue avatar Nov 29 '19 04:11 benmenadue

Can one of the admins verify this patch?

ompiteam-bot avatar Nov 29 '19 04:11 ompiteam-bot

@bosilca I'm in two minds about that. Yes, delegating up might make it cleaner, but it's really a component-specific property (e.g. other shmem components don't need a filename, and the properties of the filename it uses are unique to this component). And if you don't tag with the PID of the process itself, only with the jobid / daemon pid, you will still get collisions between processes within that job.

benmenadue avatar Dec 02 '19 00:12 benmenadue

ok to test

awlauria avatar Sep 17 '20 17:09 awlauria

@benmenadue @bosilca What's the fate for this PR? it's quite old.

gpaulsen avatar Jun 30 '21 14:06 gpaulsen

Not perfect, but definitively better than before. Good to go from my perspective.

bosilca avatar Jun 30 '21 14:06 bosilca

This PR has conflicts that need to be resolved.

jsquyres avatar Nov 02 '21 15:11 jsquyres

I've rebased the changes onto current master.

benmenadue avatar Nov 04 '21 23:11 benmenadue

@bosilca @jsquyres Can we merge this now? Looks ready to go.

gpaulsen avatar Jan 14 '22 16:01 gpaulsen

PIDs aren't unique when you consider the full uptime of a system - they will eventually be reused (e.g. the login nodes of our cluster fairly regularly rollover the pid counter). The problem we were having is that there were left over files in /dev/shmem from previous, failed runs (there's no opportunity to clean them up in the event of a hard crash).

It's probably unlikely, but certainly not impossible, for there to be a collision here (especially when you consider that an MPI job will typically comprise a number of processes); also including the uid makes it that much more unlikely.

It's also possible for two distinct, concurrent processes to see the same pid when you consider pid namespaces, such as in containers. While it's likely that each would have its own /dev/shm mount, that's not a requirement on using namespaces. And actually, in this case even including the uid might not be enough - consider the case where each process in an MPI job is in its own pid namespace; each process might have the same uid and pid.

I do agree that there's probably a better way of managing this than just minimising the chance of collision, but I haven't thought what that might be yet.

benmenadue avatar Jan 15 '22 03:01 benmenadue

It was not clear from the comments what you were trying to achieve. Now that I understand better, it seems possible, but extremely unlikely to hit the scenario you describe. However, I don't think the correct approach is to build the filename from well-defined entities.

After looking at the code, I think the real problem is in the way shmem support is implemented, it generates a new file name for the segment while completely ignoring the file name provided by the upper layer (filename that has been carefully crafted to provide unicity). Thus, I suggest to replace the approach implemented here with something similar to what the mmap module is doing, either using the filename provided by the caller, or using get_uniq_file_name to generate one that is unique (we can move the get_uniq_file_name function in a common place, aka base).

bosilca avatar Jan 17 '22 20:01 bosilca

Yes, that would be a much better approach. I didn't realise there was already a unique name being passed down from the upper layers. Just need to keep in mind that apparently some platforms have a limit on how long that name can be. I'll have a look around the code and see what I can come up with.

Collisions with the original code are much more likely, since there's a fixed set of 128 potential names. It's all-but-guaranteed when you consider the high core counts of newer machines (each process needs its own unique name). So did you want to merge this for now as an interim measure until we come up with a better solution? Otherwise, I'm happy to just keep carrying it as a local patch to help for our system in particular.

benmenadue avatar Jan 17 '22 20:01 benmenadue

Yes, you need to manipulate the filename string a little, because it might contain separators and/or might be too long. But is can give you a nice base for uniqueness.

Moving get_uniq_file_name out of the mmap component and using it in the shmem component should be a quick patch, a solution that not only addresses your issue but also improve the consistency of the code. I would hold on this PR until then.

bosilca avatar Jan 17 '22 20:01 bosilca

@benmenadue, any interest in resurrecting this PR? looks pretty close to done, though some changes still requested.

gpaulsen avatar Aug 31 '22 20:08 gpaulsen

@gpaulsen I completely forgot about making those changes, sorry. I had a look at the code and can definitely see how to improve on this based on those suggestions, and it shouldn't be too difficult at all. I'll do that tomorrow and update the PR.

benmenadue avatar Sep 06 '22 06:09 benmenadue