ompi
ompi copied to clipboard
shmem/posix: change backing filename structure to reduce collisions
The POSIX shmem component currently uses a filename of the form /open_mpi.nnnn
, where nnnn
is the "try" count, running from 0
to OPAL_SHMEM_POSIX_MAX_ATTEMPTS - 1
. OPAL_SHMEM_POSIX_MAX_ATTEMPTS
is currently 128. It tries each of these possibilities in turn until it finds an unused one or runs out (at which point the component disqualifies itself).
Our new HPC system has 48 cores per node, and so this small number is easily exhausted; you just need two all-core runs to not clean up after themselves and you've run out of potential filenames. Or, in the case of our login nodes, a couple of users to accidentally launch there.
Moreover, even on a clean system, at least one process would need to try at least 48 times to find an available filename.
This patch changes the structure of the filenames to include the user's uid and the process's pid, which significantly reduces the chance of a collision, both between the ranks of this MPI job and with any other MPI job that might be running / have run on the same node.
Can one of the admins verify this patch?
@bosilca I'm in two minds about that. Yes, delegating up might make it cleaner, but it's really a component-specific property (e.g. other shmem components don't need a filename, and the properties of the filename it uses are unique to this component). And if you don't tag with the PID of the process itself, only with the jobid / daemon pid, you will still get collisions between processes within that job.
ok to test
@benmenadue @bosilca What's the fate for this PR? it's quite old.
Not perfect, but definitively better than before. Good to go from my perspective.
This PR has conflicts that need to be resolved.
I've rebased the changes onto current master.
@bosilca @jsquyres Can we merge this now? Looks ready to go.
PIDs aren't unique when you consider the full uptime of a system - they will eventually be reused (e.g. the login nodes of our cluster fairly regularly rollover the pid counter). The problem we were having is that there were left over files in /dev/shmem
from previous, failed runs (there's no opportunity to clean them up in the event of a hard crash).
It's probably unlikely, but certainly not impossible, for there to be a collision here (especially when you consider that an MPI job will typically comprise a number of processes); also including the uid makes it that much more unlikely.
It's also possible for two distinct, concurrent processes to see the same pid when you consider pid namespaces, such as in containers. While it's likely that each would have its own /dev/shm
mount, that's not a requirement on using namespaces. And actually, in this case even including the uid might not be enough - consider the case where each process in an MPI job is in its own pid namespace; each process might have the same uid and pid.
I do agree that there's probably a better way of managing this than just minimising the chance of collision, but I haven't thought what that might be yet.
It was not clear from the comments what you were trying to achieve. Now that I understand better, it seems possible, but extremely unlikely to hit the scenario you describe. However, I don't think the correct approach is to build the filename from well-defined entities.
After looking at the code, I think the real problem is in the way shmem support is implemented, it generates a new file name for the segment while completely ignoring the file name provided by the upper layer (filename that has been carefully crafted to provide unicity). Thus, I suggest to replace the approach implemented here with something similar to what the mmap module is doing, either using the filename provided by the caller, or using get_uniq_file_name to generate one that is unique (we can move the get_uniq_file_name function in a common place, aka base).
Yes, that would be a much better approach. I didn't realise there was already a unique name being passed down from the upper layers. Just need to keep in mind that apparently some platforms have a limit on how long that name can be. I'll have a look around the code and see what I can come up with.
Collisions with the original code are much more likely, since there's a fixed set of 128 potential names. It's all-but-guaranteed when you consider the high core counts of newer machines (each process needs its own unique name). So did you want to merge this for now as an interim measure until we come up with a better solution? Otherwise, I'm happy to just keep carrying it as a local patch to help for our system in particular.
Yes, you need to manipulate the filename string a little, because it might contain separators and/or might be too long. But is can give you a nice base for uniqueness.
Moving get_uniq_file_name out of the mmap component and using it in the shmem component should be a quick patch, a solution that not only addresses your issue but also improve the consistency of the code. I would hold on this PR until then.
@benmenadue, any interest in resurrecting this PR? looks pretty close to done, though some changes still requested.
@gpaulsen I completely forgot about making those changes, sorry. I had a look at the code and can definitely see how to improve on this based on those suggestions, and it shouldn't be too difficult at all. I'll do that tomorrow and update the PR.