orbax icon indicating copy to clipboard operation
orbax copied to clipboard

If you copy checkpoints from HOME to gcs they can get deleted

Open sshleifer opened this issue 2 years ago • 3 comments

because of this line https://github.com/google/orbax/blob/c7a7fd48ff094ac4167d3fb1c10f8b6c8de32b3a/checkpoint/orbax/checkpoint/utils.py#L669

They don't have success file but are in GCS so orbax thinks its tmp and cleans it up.

I would suggest always or never saving COMMIT_SUCCESS file.

This is not blocking me (easy to just write extra commit success files once I found this) but it felt like I should report because it was very unexpected behavior and moving around checkpoints is super common.

sshleifer avatar Jul 08 '23 02:07 sshleifer

Thanks for the report, we currently have different behavior for ensuring atomicity on GCS vs. other filesystems. This was sort of a practice that we inherited from earlier code. I will look into standardizing this.

cpgaffney1 avatar Jul 10 '23 17:07 cpgaffney1

+1 on this. I was having a lot of issues trying to load a checkpoint that was saved locally and copied to GCS, and orbax keeps telling me that the checkpoint is incomplete because of the missing _COMMIT_SUCCESS_FILE file.

young-geng avatar Jun 18 '24 10:06 young-geng

Update: our previous intention was to switch to the same logic everywhere, i.e. relying on atomic rename. It is not possible to rely on this for all filesystems, though, so we're instead intending to make it configurable, while defaulting to atomic rename for GCS and internal. This has a higher priority now, to better support cloud users - hopefully will get to it within a month.

cpgaffney1 avatar Jun 18 '24 20:06 cpgaffney1