axlearn
axlearn copied to clipboard
Upgrades version of Orbax/TF/TFIO to support HNS atomic folder rename
Summary
This PR updates the TensorFlow and Orbax dependencies to support the new HNS-native RenameFolder API.
Changes
To leverage this feature for HNS buckets, you'd need to configure ocp.CheckpointManagerOptions with todelete_full_path="_trash".
- Impact: When
max_to_keepis exceeded, old checkpoints are now atomically moved to a_trashsubdirectory instead of being deleted.
Context & Motivation
- TensorFlow Support: TensorFlow has added support for the HNS
RenameFolderAPI, allowing for recursive, atomic directory moves. - Orbax Integration: Orbax now exposes a
todelete_full_pathoption inCheckpointManagerOptions. When enabled, Orbax delegates totf.io.gfile.renameto move old checkpoints to a trash directory rather than performing a slow, object-by-object deletion. - Performance: On HNS buckets, renaming a folder is significantly faster than standard deletion.
Validation
Scale testing was conducted on Axlearn workloads using this configuration. Results confirmed that the rename operations were significantly faster than the previous deletion mechanism, reducing overhead during checkpoint rotation.
Configuration Snippet
options=ocp.CheckpointManagerOptions(
create=True,
max_to_keep=cfg.keep_last_n,
enable_async_checkpointing=True,
step_name_format=self._name_format,
should_save_fn=save_fn_with_summaries,
enable_background_delete=True,
async_options=ocp.options.AsyncOptions(timeout_secs=cfg.async_timeout_secs),
# New HNS optimization:
todelete_full_path="_trash",
)