sst-core icon indicating copy to clipboard operation
sst-core copied to clipboard

Add checkpointing

Open feldergast opened this issue 1 year ago • 5 comments

Add the ability to checkpoint and restart SST runs.

feldergast avatar Sep 28 '23 15:09 feldergast

Use cases:

An HPC provider (a National Lab) limits HPC jobs to 24 hours. Therefore checkpointing would be useful to run longer jobs.

The separate motivation is that if a job takes 24.5 hours (longer than expected) the job is killed. As a result, the researcher loses days of productivity because the job had been waiting in the queue.

bhpayne avatar Sep 29 '23 12:09 bhpayne

Work is underway to make critical structures in SST-Core serializable in preparation for checkpoint generation. We've identified that many structures can be serialized without visible API changes. Handlers however (e.g., clock/event handlers) will need a visible change in definition as follows:

    handler_old = new Handler<Class>(this, &callback_function); // Current way
    handler_new = new Handler2<Class, &callback_function>(this); // New way

For backwards compatibility, the old definition would still work but components using them would not be checkpoint-able.

Other implementation notes:

  • First pass will not support re-partitioning at restart, but this is a longer-term goal. Additional information needs to be kept or reconstructed during a checkpoint to enable it.
  • BaseComponent needs an API to allow elements to checkpoint themselves. This API will evolve as we extend checkpoint support to element libraries. Some libraries may not be checkpoint-able without significant changes.
  • Statistics infrastructure has not yet been evaluated
  • Profile points are not easily checkpointed given their shared state, and cannot be checkpointed once we support re-partitioning. Instead of checkpointing, they would be regenerated on restart.
  • Checkpoint would be supported in SST's run loop. It would not be supported during construction/init/setup.

gvoskuilen avatar Apr 05 '24 15:04 gvoskuilen

A few more details on serialization changes needed for checkpointing:

Checkpointing was implemented for "base" objects by implementing a template for the given type. The template was called "serialize". As a convenience function, operator& was also overloaded as a template to allow for simpler syntax in the serialize_order functions. In order to support pointer tracking this structure has been changed somewhat. operator& still calls the serialize template, which does all the pointer tracking, and the original serialize templates have been renamed to serialize_impl. There is a new function call on the serializer to turn pointer tracking on, and once on, it will keep track of all pointers. The data is serialized with the first instance of the pointer and all subsequent instances just put in a tag to the first instance. On deserialization, the object is recreated at the first instance, and all other instances will just be given the pointer to the new object.

Added a new template operator| (operator or). This is only used for the very specific instance of treating a non-pointer as a pointer in the case where the data is stored directly in the object (for example in a map or set), but other objects have pointers to the data. This is needed for the ComponentInfo objects of SubComponents, where the ComponentInfo object is stored in the parent in a std::map<ComponentId_t, ComponentInfo) and the SubComponent has a pointer to the data in its parent. A limitation of this function is that the non-pointer data must be serialized first.

Made a serialize_impl template instantiation that will handle non-polymorphic classes. This allows a non-polymorphic class to serialize with only a serialize_order function and no need to inherit from serializable.

We are considering added an implementation of serialize_impl that will handle classes that return true for std::is_trivially_copyable. In this case, there would be no need for a serialize_order function and it would just use memcpy to serialize the event. We still need to evaluate if is_trivially_copyable will return true for a class with a pointer as one of its data members. If it does, we won't be able to make this work as the pointer would not be pointing to the correct data when deserialized.

feldergast avatar Apr 15 '24 16:04 feldergast

Update on TimeVortex checkpointing:

Ultimately, we plan to have events serialized with the Components they are targeting. This is to enable different pre- and post- checkpoint/restart partitioning. For the initial implementation, the TimeVortex will be serialized in-place and the Event::Handlers will be "fixed up" after restarting to point to the correct post-restart handler. This is done by having the Links report their handlers (tag and new pointer) to the Simulation_impl object so that it can exchange the old pointer (used as the tag in the checkpoint) with the new pointer.

feldergast avatar Apr 15 '24 16:04 feldergast

Update on statistics checkpoint: Support for checkpointing statistics was merged in PR #1098

gvoskuilen avatar Jul 16 '24 16:07 gvoskuilen