scr
scr copied to clipboard
Redefine meaning of SCR_ENABLE=0
Originally, SCR_ENABLE was a way to turn off the SCR API from having any effect. It was most useful in the scripts, but in the API it essentially turned all calls into NOPs. The one change it would do was to set the output flag=0 in Need_checkpoint, but otherwise it would return immediately with SCR_FAILURE from all calls. The idea was that users would not call SCR APIs if they had disabled it.
One user request is to change this so that SCR still does something when SCR_ENABLE=0, but to have it behave differently than when SCR_ENABLE=1. This is a useful change, especially if it simplifies the code path for the user so they can just always call the SCR API in either case. However, it requires considering each API call individually and defining what its behavior should be when SCR_ENABLE=0, e.g., what input variables does it require to be valid, what output variables should it change, what should its return code be.
The API docs should also be updated to document the behavior.
There is currently odd behaviour. In v1.2.2 for example SCR_Start_checkpoint
reads
int SCR_Start_checkpoint()
{
/* manage state transition */
if (scr_state != SCR_STATE_IDLE) {
scr_state_transition_error(scr_state, "SCR_Start_checkpoint()", __FILE__, __LINE__);
}
scr_state = SCR_STATE_CHECKPOINT;
/* if not enabled, bail with an error */
if (! scr_enabled) {
return SCR_FAILURE;
}
ie it will allow a state transition even if scr_enbaled
is false. However SCR_Route_file
has code
int SCR_Route_file(const char* file, char* newfile)
{
/* manage state transition */
if (scr_state != SCR_STATE_RESTART &&
scr_state != SCR_STATE_CHECKPOINT &&
scr_state != SCR_STATE_OUTPUT)
{
// [...]
return SCR_SUCCESS;
}
/* if not enabled, bail with an error */
if (! scr_enabled) {
return SCR_FAILURE;
}
ie will return without setting newfile
(in particular it does not strdup file
to newfile
if in scr_state == SCR_STATE_CHECKPOINT
.
which is neither a no-op (the state changes) nor a useful operation (newfile
is not set to a copy of file
) in the SCR_Start_checkpoint
/ SCR_Route_file
pair.
Yeah, other oddness is that Need_checkpoint sets the output flag to 0 when SCR_ENABLE=0, so it's not quite a NOP either, even disregarding the state tracking. Given that we want to redefine SCR_ENABLE=0 to still do something, there are at least four logical choices I can think of for this output parameter:
- always set to 0 to mean a checkpoint is never needed
- always set to 1 to mean a checkpoint is always needed
- still execute the checkpoint decision logic to determine whether a checkpoint should be taken
- do not change the variable at all, which would mean the user could essentially define the behavior for this function depending on the value they use to initialize the flag parameter before they call the function
We need to re-evaluate what each function of the SCR API does in the case of SCR_ENABLE=0 since that is currently undefined. This is not trivial and probably needs some discussion (during a team meeting) and then documentation, particularly around which arguments are valid if NULL and what the return value should be (currently we're always returning SCR_FAILURE).
I think our goal should be:
- with SCR_ENABLE=1, we do the magic we've always done.
- with SCR_ENABLE=0, we act as transparently as possible... as if SCR weren't in the code. (which requires some guessing at what our users do by default).
This would allow applications to always compile SCR into the code, but SCR would not be activated until users take action to turn it on with SCR_ENABLE=1.
To be complete, the run time scripts (scr_srun/prerun/postrun and friends) should also be considered to see what should and should not be done in those two cases. I'm not sure that's well defined right now either.
To get the discussion started, and based on the oral discussion on this week's (2020-02-18) call, here's a suggestion for behaviour of SCR if SCR_ENABLE=0
:
As a general rule, a code that used to checkpoint without SCR, then had SCR added, and is now run with SCR_ENABLE=0 should behave in the same way as before SCR was added. In detail this means:
- no SCR API function must return
SCR_FAILURE
(solely) becauseSCR_ENABLE=0
-
SCR_Need_checkpoint
sets its flag to 0 (no checkpoint requested) -
SCR_Have_restart
sets its flag to 0 (no recovery possible) -
SCR_Route_file
copies the passed in string to the output buffer -
SCR_Should_exit
sets its flag to 0 (no exit requested) -
SCR_Start_XXX
andSCR_Complete_XXX
returnSCR_SUCCESS
without doing anything -
SCR_Get_version
behaves the same way no matter what valueSCR_ENABLE
has - no copying of files happens in the utilities
- no watchdog is started by scr_run
- scripts that only query the nodes behave as usual (eg
scr_node_file
) except they always assume all nodes are up - no logs are written by
scr_log_XXX
functions
Thanks for the thorough analysis @rhaas80 !
I agree with every point, except point 2. I think we should be on the safe side and always say that an application should take a checkpoint (better safe than sorry). If an application is relying solely on a call to SCR_Need_checkpoint
to determine if it should checkpoint, then always returning 0 means never checkpointing... which is most likely not the expected default behavior of an application. With can then advertise that with SCR, applications can checkpoint less frequently, rather than all the time.
@adammoody suggested a way for users pre-initialize the variable so that users could define their own preferred behavior... but I think my preference is for always returning 1. Maybe we can check with @kathrynmohror at the next meeting and see if she has a different sense of what applications folks expect by default.
Another question is whether someone wants to "disable" SCR or whether they want it to function normally except that it only uses the parallel file system. I don't know, but we could bounce that question off the users who are requesting this.
The latter case (https://github.com/LLNL/scr/issues/54) would greatly increase the number of codes that can use SCR. The idea here is that we assume there is no cache space for a checkpoint. Need_checkpoint would work as normal. Start/Complete_checkpoint would still track the files that are part of the checkpoint, with Route_file directing files to the parallel file system, and write the dataset summary files to the parallel file system like it does after a flush. Have_restart and Start/Complete_restart would still work as normal, but again Route_file would point to the parallel file system. SCR_Init would still internally identify the most recent checkpoint available on the file system, but it would not fetch or rebuild user data (but maybe still SCR metadata). Should_exit would still work as normal. The scavenge script would not need to do anything.
@adammoody my gut says these are two different use cases that should be thought through independently:
- SCR enable off
- pass through (no cache situation you describe for #54
One reason is that (as you note) SCR_ENABLE=0 behavior is not well-defined in SCR so we need to address it. In this case, we are not managing checkpoints at all so we should really not do anything, including routing to the parallel file system. For #54 (pass through) we'll still be managing checkpoints in that case but just not using in-system cache. This will require more thought and work since we'll be touching config files, metadata, etc throughout the SCR code base.
Obviously, I don't think we should forget about #54. I just think this change is independent of that. Let me know if I misunderstood your point.
@gonsie I think your point on SCR_Need_checkpoint is right, that we should always return yes when SCR is not enabled.
My thought is how do we define the API to support applications that have fully integrated the SCR API. For context, such an app might look like the following:
mpi_init
scr_init
scr_have_restart(flag)
if flag
scr_start_restart
scr_route_file
read checkpoint file
scr_complete_restart
while (loop simulation)
do work step
scr_need_checkpoint(flag)
if flag
scr_start_checkpoint
scr_route_file
write checkpoint file
scr_complete_checkpoint
scr_should_exit(flag)
if flag
break
scr_finalize
mpi_finalize
The definitions as currently proposed will require changes to this program. For one, the restart logic will need to know whether SCR is enabled or not. If scr_have_restart always returns 0, the restart logic here will never read back a checkpoint file. I suspect the authors would also want to avoid calling scr_need_checkpoint when scr is not enabled, since they probably do not want to checkpoint on every loop iteration. They could not rely on scr_should_exit to tell them when to bail out, so they'd need to add logic there to do something different.
It would be nice to find definitions so that a program like the one above still works without requiring the developers to write up two code paths.
Having said that, I think the proposed APIs would support apps that only use the checkpoint/output APIs of SCR, and that's the first step people take when integrating.
Right, making sure the API is well-defined in the case that SCR_ENABLE=0 is a clear need, and yes supporting the full PFS-only mode will be a lot of work.
As far as I know, we've only had one user so far attempt to use SCR_ENABLE=0, and I believe his expectation was that this setting would change the library to behave in a "PFS-only" mode so that he didn't need to write two different code paths in his checkpoint logic. Extrapolating that expectation to the rest of the API is how I got to the above view.
On the other hand, if someone is willing to write two different code paths depending on whether scr is enabled, then it seems like they could write two code paths for all portions of their code and avoid calling any scr function when it is not enabled. It feels sort of half-done if we define SCR_ENABLE=0 behavior to only support a subset of scr calls unless we believe that covers a large number of users (which it might).