scr icon indicating copy to clipboard operation
scr copied to clipboard

shared-cache - enable shared-cache support in SCR

Open mcfadden8 opened this issue 2 years ago • 2 comments

SCR can direct the application to write dataset files to subdirectories within a cache directory. SCR also stores its redundancy data in these subdirectories.

Question: Should it be considered an error to configure redunancy schemes when cache is shared?

To construct the full path of a cache directory, SCR incorporates a cache base directory name (SCR_CACHE_BASE) with the user name and the allocation id associated with the resource allocation.

The cache directory name is currently derived from the concatenation of the cache base directory (SCR_CACHE_BASE), the user name running the application, and the job scheduler resource allocation id. This presents a name collision problem when the cache is on a shared file system.

This ticket proposes that the cache directory name should also have the MPI rank numbed appended to the name above.

Question: Should we just append this in general? Or only when the cache is on a shared file system, which begs the question of how SCR can determine when/if the file system is shared. My vote is to simply append the rank number as a general rule after the session id.

mcfadden8 avatar Feb 23 '22 18:02 mcfadden8

@adammoody, I think that this change is required in order for SCR to support a shared cache. Do you agree?

If so, should a shared cache be a mode that SCR is configured in? Or, should we simply change the naming scheme in general so that it works in both a shared and non-shared cache?

mcfadden8 avatar Jul 27 '22 18:07 mcfadden8

To start with, let's only claim to support SINGLE when using a shared cache. We'll assume that the shared cache is reliable enough that redundancy is not necessary. Also, I think it'll be too complicated for us (and maybe not possible) to try to implement a redundancy scheme that could actually tolerate failures of the file system, e.g. in the case that a Lustre server drops out.

I don't know whether we can easily enforce that one only uses SINGLE, since we can't easily determine whether a storage location is node-local or global. Having said that, I think Dong had created something we may be able to use (https://computing.llnl.gov/projects/fast-global-file-status). But to keep things simple, let's pretend that we can't for now.

Instead, we can document that it is on the user to mark any shared storage as GLOBAL by defining a proper storage descriptor. We can then enforce that only the SINGLE redundancy scheme is valid to use with a GLOBAL storage descriptor.

  • It's on the user to define a store descriptor to specify a shared cache as GLOBAL. If they don't, SCR is allowed to blow up in some bad way.
  • SCR enforces that only SINGLE can be used with a GLOBAL store descriptor.

adammoody avatar Jul 27 '22 20:07 adammoody