scr Redefine meaning of SCR

Originally, SCR_ENABLE was a way to turn off the SCR API from having any effect. It was most useful in the scripts, but in the API it essentially turned all calls into NOPs. The one change it would do was to set the output flag=0 in Need_checkpoint, but otherwise it would return immediately with SCR_FAILURE from all calls. The idea was that users would not call SCR APIs if they had disabled it.

One user request is to change this so that SCR still does something when SCR_ENABLE=0, but to have it behave differently than when SCR_ENABLE=1. This is a useful change, especially if it simplifies the code path for the user so they can just always call the SCR API in either case. However, it requires considering each API call individually and defining what its behavior should be when SCR_ENABLE=0, e.g., what input variables does it require to be valid, what output variables should it change, what should its return code be.

The API docs should also be updated to document the behavior.

Jan 30 '20 18:01 adammoody

There is currently odd behaviour. In v1.2.2 for example SCR_Start_checkpoint reads

int SCR_Start_checkpoint()
{
  /* manage state transition */
  if (scr_state != SCR_STATE_IDLE) {
    scr_state_transition_error(scr_state, "SCR_Start_checkpoint()", __FILE__, __LINE__);
  }
  scr_state = SCR_STATE_CHECKPOINT;

  /* if not enabled, bail with an error */
  if (! scr_enabled) {
    return SCR_FAILURE;
  }

ie it will allow a state transition even if scr_enbaled is false. However SCR_Route_file has code

int SCR_Route_file(const char* file, char* newfile)
{
  /* manage state transition */
  if (scr_state != SCR_STATE_RESTART    &&
      scr_state != SCR_STATE_CHECKPOINT &&
      scr_state != SCR_STATE_OUTPUT)
  {
    // [...]
    return SCR_SUCCESS;
  }

  /* if not enabled, bail with an error */
  if (! scr_enabled) {
    return SCR_FAILURE;
  }

ie will return without setting newfile (in particular it does not strdup file to newfile if in scr_state == SCR_STATE_CHECKPOINT.

which is neither a no-op (the state changes) nor a useful operation (newfile is not set to a copy of file) in the SCR_Start_checkpoint / SCR_Route_file pair.

Jan 30 '20 19:01 rhaas80

Yeah, other oddness is that Need_checkpoint sets the output flag to 0 when SCR_ENABLE=0, so it's not quite a NOP either, even disregarding the state tracking. Given that we want to redefine SCR_ENABLE=0 to still do something, there are at least four logical choices I can think of for this output parameter:

always set to 0 to mean a checkpoint is never needed
always set to 1 to mean a checkpoint is always needed
still execute the checkpoint decision logic to determine whether a checkpoint should be taken
do not change the variable at all, which would mean the user could essentially define the behavior for this function depending on the value they use to initialize the flag parameter before they call the function

Jan 30 '20 19:01 adammoody

We need to re-evaluate what each function of the SCR API does in the case of SCR_ENABLE=0 since that is currently undefined. This is not trivial and probably needs some discussion (during a team meeting) and then documentation, particularly around which arguments are valid if NULL and what the return value should be (currently we're always returning SCR_FAILURE).

I think our goal should be:

with SCR_ENABLE=1, we do the magic we've always done.
with SCR_ENABLE=0, we act as transparently as possible... as if SCR weren't in the code. (which requires some guessing at what our users do by default).

This would allow applications to always compile SCR into the code, but SCR would not be activated until users take action to turn it on with SCR_ENABLE=1.

Jan 30 '20 19:01 gonsie

To be complete, the run time scripts (scr_srun/prerun/postrun and friends) should also be considered to see what should and should not be done in those two cases. I'm not sure that's well defined right now either.

Jan 30 '20 20:01 adammoody

To get the discussion started, and based on the oral discussion on this week's (2020-02-18) call, here's a suggestion for behaviour of SCR if SCR_ENABLE=0:

As a general rule, a code that used to checkpoint without SCR, then had SCR added, and is now run with SCR_ENABLE=0 should behave in the same way as before SCR was added. In detail this means:

no SCR API function must return SCR_FAILURE (solely) because SCR_ENABLE=0
SCR_Need_checkpoint sets its flag to 0 (no checkpoint requested)
SCR_Have_restart sets its flag to 0 (no recovery possible)
SCR_Route_file copies the passed in string to the output buffer
SCR_Should_exit sets its flag to 0 (no exit requested)
SCR_Start_XXX and SCR_Complete_XXX return SCR_SUCCESS without doing anything
SCR_Get_version behaves the same way no matter what value SCR_ENABLE has
no copying of files happens in the utilities
no watchdog is started by scr_run
scripts that only query the nodes behave as usual (eg scr_node_file) except they always assume all nodes are up
no logs are written by scr_log_XXX functions

Feb 21 '20 00:02 rhaas80

Thanks for the thorough analysis @rhaas80 !

I agree with every point, except point 2. I think we should be on the safe side and always say that an application should take a checkpoint (better safe than sorry). If an application is relying solely on a call to SCR_Need_checkpoint to determine if it should checkpoint, then always returning 0 means never checkpointing... which is most likely not the expected default behavior of an application. With can then advertise that with SCR, applications can checkpoint less frequently, rather than all the time.

@adammoody suggested a way for users pre-initialize the variable so that users could define their own preferred behavior... but I think my preference is for always returning 1. Maybe we can check with @kathrynmohror at the next meeting and see if she has a different sense of what applications folks expect by default.

Feb 22 '20 18:02 gonsie

Another question is whether someone wants to "disable" SCR or whether they want it to function normally except that it only uses the parallel file system. I don't know, but we could bounce that question off the users who are requesting this.

The latter case (https://github.com/LLNL/scr/issues/54) would greatly increase the number of codes that can use SCR. The idea here is that we assume there is no cache space for a checkpoint. Need_checkpoint would work as normal. Start/Complete_checkpoint would still track the files that are part of the checkpoint, with Route_file directing files to the parallel file system, and write the dataset summary files to the parallel file system like it does after a flush. Have_restart and Start/Complete_restart would still work as normal, but again Route_file would point to the parallel file system. SCR_Init would still internally identify the most recent checkpoint available on the file system, but it would not fetch or rebuild user data (but maybe still SCR metadata). Should_exit would still work as normal. The scavenge script would not need to do anything.

Feb 23 '20 01:02 adammoody

@adammoody my gut says these are two different use cases that should be thought through independently:

SCR enable off
pass through (no cache situation you describe for #54

One reason is that (as you note) SCR_ENABLE=0 behavior is not well-defined in SCR so we need to address it. In this case, we are not managing checkpoints at all so we should really not do anything, including routing to the parallel file system. For #54 (pass through) we'll still be managing checkpoints in that case but just not using in-system cache. This will require more thought and work since we'll be touching config files, metadata, etc throughout the SCR code base.

Obviously, I don't think we should forget about #54. I just think this change is independent of that. Let me know if I misunderstood your point.

@gonsie I think your point on SCR_Need_checkpoint is right, that we should always return yes when SCR is not enabled.

Feb 23 '20 12:02 kathrynmohror

My thought is how do we define the API to support applications that have fully integrated the SCR API. For context, such an app might look like the following:

mpi_init
scr_init

scr_have_restart(flag)
if flag
  scr_start_restart
  scr_route_file
  read checkpoint file
  scr_complete_restart

while (loop simulation)
  do work step

  scr_need_checkpoint(flag)
  if flag
    scr_start_checkpoint
    scr_route_file
    write checkpoint file
    scr_complete_checkpoint

  scr_should_exit(flag)
  if flag
    break

scr_finalize
mpi_finalize

The definitions as currently proposed will require changes to this program. For one, the restart logic will need to know whether SCR is enabled or not. If scr_have_restart always returns 0, the restart logic here will never read back a checkpoint file. I suspect the authors would also want to avoid calling scr_need_checkpoint when scr is not enabled, since they probably do not want to checkpoint on every loop iteration. They could not rely on scr_should_exit to tell them when to bail out, so they'd need to add logic there to do something different.

It would be nice to find definitions so that a program like the one above still works without requiring the developers to write up two code paths.

Having said that, I think the proposed APIs would support apps that only use the checkpoint/output APIs of SCR, and that's the first step people take when integrating.

Feb 23 '20 20:02 adammoody

Right, making sure the API is well-defined in the case that SCR_ENABLE=0 is a clear need, and yes supporting the full PFS-only mode will be a lot of work.

As far as I know, we've only had one user so far attempt to use SCR_ENABLE=0, and I believe his expectation was that this setting would change the library to behave in a "PFS-only" mode so that he didn't need to write two different code paths in his checkpoint logic. Extrapolating that expectation to the rest of the API is how I got to the above view.

On the other hand, if someone is willing to write two different code paths depending on whether scr is enabled, then it seems like they could write two code paths for all portions of their code and avoid calling any scr function when it is not enabled. It feels sort of half-done if we define SCR_ENABLE=0 behavior to only support a subset of scr calls unless we believe that covers a large number of users (which it might).

Feb 23 '20 21:02 adammoody

scr
scr copied to clipboard

Redefine meaning of SCR_ENABLE=0

scr scr copied to clipboard

Redefine meaning of SCR_ENABLE=0

scr
scr copied to clipboard