scr
scr copied to clipboard
shared-cache - enable shared-cache support in SCR
SCR can direct the application to write dataset files to subdirectories within a cache directory. SCR also stores its redundancy data in these subdirectories.
Question: Should it be considered an error to configure redunancy schemes when cache is shared?
To construct the full path of a cache directory, SCR incorporates a cache base directory name (SCR_CACHE_BASE
) with the user name and the allocation id associated with the resource allocation.
The cache directory name is currently derived from the concatenation of the cache base directory (SCR_CACHE_BASE
), the user name running the application, and the job scheduler resource allocation id. This presents a name collision problem when the cache is on a shared file system.
This ticket proposes that the cache directory name should also have the MPI rank numbed appended to the name above.
Question: Should we just append this in general? Or only when the cache is on a shared file system, which begs the question of how SCR can determine when/if the file system is shared. My vote is to simply append the rank number as a general rule after the session id.
@adammoody, I think that this change is required in order for SCR to support a shared cache. Do you agree?
If so, should a shared cache be a mode that SCR is configured in? Or, should we simply change the naming scheme in general so that it works in both a shared and non-shared cache?
To start with, let's only claim to support SINGLE
when using a shared cache. We'll assume that the shared cache is reliable enough that redundancy is not necessary. Also, I think it'll be too complicated for us (and maybe not possible) to try to implement a redundancy scheme that could actually tolerate failures of the file system, e.g. in the case that a Lustre server drops out.
I don't know whether we can easily enforce that one only uses SINGLE
, since we can't easily determine whether a storage location is node-local or global. Having said that, I think Dong had created something we may be able to use (https://computing.llnl.gov/projects/fast-global-file-status). But to keep things simple, let's pretend that we can't for now.
Instead, we can document that it is on the user to mark any shared storage as GLOBAL
by defining a proper storage descriptor. We can then enforce that only the SINGLE
redundancy scheme is valid to use with a GLOBAL
storage descriptor.
- It's on the user to define a store descriptor to specify a shared cache as
GLOBAL
. If they don't, SCR is allowed to blow up in some bad way. - SCR enforces that only
SINGLE
can be used with aGLOBAL
store descriptor.