ignite icon indicating copy to clipboard operation
ignite copied to clipboard

ModelCheckpoint improvement from "remove & write" to "write & remove"

Open devrimcavusoglu opened this issue 3 years ago • 14 comments

Lately I experienced an issue with model checkpointing, so I wanted to move it to a discussion although I am unsure about this is a "bug", and thus I opened as "question". To sum it up, when using model checkpointing with certain a certain configuration, where n_saved=1, there is a potential risk to lose the checkpoint due to "remove first, and then write" logic.

Problem You can create a model checkpoint at specific timestamps to save a checkpoint. However, there is a potential risk that you can lose the checkpoint due to write errors, interruption, machine shut-down, or outage etc. DiskSaver used on ModelCheckpoint first checks if checkpoint count doesn't exceed n_saved, and after that it removes old/older checkpoints, and then writes new checkpoint. When n_saved=1 this turns it into basic "remove existing & write", and if by any case write process is corrupted, then since the older checkpoint is deleted, you simply waste your training resources and time. Option n_saved > 1 can create many dangling and redundant model checkpoints, and with large number of experiments and experimenting with huge models especially on cloud causes unnecessary files claiming large amount of storage space.

Ideas The trivial idea is to set n_saved > 1, but this has some negative consequences that people may not want generally and avoid it. The second idea is to replace "remove first and then write" logic to "write first and then remove" logic. Is there any possible practical ideas ?

NOTE: This issue is opened to discuss the situation, not meant to imply any feature request.

devrimcavusoglu avatar May 04 '21 14:05 devrimcavusoglu

@devrimcavusoglu thanks for the question and pointing out that ! In the very begining we had "write and remove" way to do things and we changed it due to https://github.com/pytorch/ignite/pull/1117 and related issue https://github.com/pytorch/ignite/issues/1056 (https://github.com/pytorch/ignite/issues/1056#issuecomment-639895281)

I understand your point and that it can be critical to lose the single model checkpoint due to that. DiskSaver has also an option atomic which does "checkpoint is serialized to a temporary file, and then moved to final destination" (https://pytorch.org/ignite/handlers.html#ignite.handlers.DiskSaver). We should think about if something could done in this space. Otherwise, user get more checkpoints that he/she wants (which is better than 0 :) ).

Maybe, n_saved=2 could be a tradeoff between space and potential risk to lose everything.

vfdev-5 avatar May 04 '21 16:05 vfdev-5

@vfdev-5 Thanks for the comment. I've looked over #1056 (and this comment spesifically). I think the order of write/remove is popped out out of scope for #1056. I feel that was an optional change.

With n_saved=1, and with the order "remove & write", the decision reduces whether to save a corrupted file or do not save a file at all in case of any corruption. Basically with n_saved=1 and atomic option and "remove & write" way, if any corruption occurs you have no checkpoint file at output_dir, and without an atomic option, you have a corrupted file at output_dir. In the former case you cannot use a checkpoint because you do not have any, and in the latter case you cannot use a checkpoint as you have a corrupted file. The end result does not matter.

However, using atomic option in the order as "write & remove" can literally help since before removing the earlier (non-corrupted) file, the temp file does not even make it into the output dir, so you preserve a noncorrupted 1-step earlier checkpoint.

It seems like with the current order (remove & write) the best you can use is n_saved > 1 to not face any issues regarding corrupted files.

To rep up, with n_saved=1 the main disadvantage of

  • write & remove: is temporarily claiming more disk space (space for 2 checkpoints instead of 1).
  • remove & write: is losing all work in case of corruption.

n_saved > 1 may have some disadvantages as well like loads of dangling/redundant checkpoints for many runs. I am just thinking out loud, so what do you think ? @vfdev-5 @bmartinn

devrimcavusoglu avatar May 06 '21 10:05 devrimcavusoglu

Thanks for summarizing the consequences of each approach.

I think the order of write/remove is popped out out of scope for #1056. I feel that was an optional change.

@devrimcavusoglu , I'd say if we would not do that, clearml users would have 2 checkpoints instead of 1 if n_saved=1.

I agree with your "rep up", that

remove & write: is losing all work in case of corruption.

could be more critical.

There could several solutions I see:

  1. no codebase change, for unstable systems, use n_saved=2 and remove the last 2 checkpoint in the end of the training to clean up
  2. we can think of introducing lazy removal op in DiskSaver or even with its base class such that effective suppression happens after save method...
  3. Add a specific case for ClearMLSaver in Checkpoint class (which is not that great)...

vfdev-5 avatar May 06 '21 19:05 vfdev-5

I'd say if we would not do that, clearml users would have 2 checkpoints instead of 1 if n_saved=1.

This is temporarily though, am I right ? (Only until the former checkpoint is deleted, and how about using atomic, would they still see 2 checkpoints if they use atomic option ? Honestly, idk if atomic plays a role that'd aid this case, just asking.)

Option 1 is a trivial solution to the issue which is safer I believe, but option 2 is a nicer solution which comprehensively handle the situation which needs an effort though. I agree with you on option 3 not being a nice solution.

How would option 2 impact other components, or how much do we concern for dependent components in case of implementing option 2?

devrimcavusoglu avatar May 06 '21 23:05 devrimcavusoglu

This is temporarily though, am I right ?

ClearML saver works with a remote storage and if i remember correctly they can not remove data from there and only overwrite it (maybe things have changed since then). This means that for n_saved=1, they have a single slot for the checkpoint and by calling remove we say in our code that we can now overwrite the checkpoint while calling save.

Option 2 may be problematic to implement correctly without having misleading methods... as we are sort of adapt it for the usage inside Checkpoint...

Maybe, we should have done lazy saving inside ClearMLSaver instead of changing the order and now inventing how to patch "remove & write" pattern.

vfdev-5 avatar May 06 '21 23:05 vfdev-5

Thanks a lot for this discussion.

I agree your comments. Option 1 solves the issue. IMO Option 2 is nicer and could be added to our todo list. However, as mentioned @vfdev-5, the policy to handle memory footprint is specific for the case n_saved=1 and should be done with attention. Option 3 lacks generality.

sdesrozis avatar May 07 '21 05:05 sdesrozis

Hi guys apologies for being so late to the discussion :)

ClearML saver works with a remote storage and if i remember correctly they can not remove data from there and only overwrite it (maybe things have changed since then).

Actually starting version (v1.0.0) there is now an API to delete remote files (even on the default http server). This means that now ClearML will be able to support Option 2 as well. I actually like the fact that we avoid overwriting the remote files, because local file corruption can be though of as relatively rare, but network issues are more common and they might create a corrupt copy on the remote storage. I still believe that n_saved=2 is probably safer if you have an unstable system, but it is always good to have more options :)

Regrading local storage policy, no way around that, if we want to increase reliability then at a certain point you will have two copies of the model. I do not think that this is a real issue, if you do not have enough free space for extra model file, your system will be unstable regardless (temp space these days is a must for many parts of the OS)

bmartinn avatar May 07 '21 13:05 bmartinn

Ok so let's keep n_saved=2 as our safest way to checkpoint models 😊

sdesrozis avatar May 07 '21 15:05 sdesrozis

Actually starting version (v1.0.0) there is now an API to delete remote files (even on the default http server). This means that now ClearML will be able to support Option 2 as well.

Well this is great. Currently, I see two consequences,

a. No change in codebase and proceed with n_saved=2 as the safest way as @sdesrozis mentioned. b. Add supporting option 2 to the TODO list.

If you decide to proceed with option (a), then this needs to be documented somewhere I believe. If you also think so, we can close this issue after having option (a) documented, if not proceeding with option (b).

devrimcavusoglu avatar May 07 '21 15:05 devrimcavusoglu

Thanks for the info @bmartinn !

I think working on the option 2 as

Maybe, we should have done lazy saving inside ClearMLSaver instead of changing the order and now inventing how to patch "remove & write" pattern.

can be the next step:

  • change back the order of "write" / "remove" in Checkpoint to have : write -> remove
  • Update ClearMLSaver code such that it could work with above change for all clearml versions (if possible...)

By the way, @devrimcavusoglu would you be interested in helping with that by sending a draft PR and we could iterate over ?

vfdev-5 avatar May 08 '21 23:05 vfdev-5

@vfdev-5 yeah, I'd like to. I can try to look at this in a while (probably in a week or two), but if any volunteer is ready to take on this right away, I don't want to block.

devrimcavusoglu avatar May 10 '21 15:05 devrimcavusoglu

@devrimcavusoglu currently no one is assigned to the issue, so it is open :) I was also checking what could we do with

Update ClearMLSaver code such that it could work with above change for all clearml versions (if possible...)

and my conclusions are not that bright... Maybe, if we could some insight on how to keep everything working and change back the order of "write" / "remove"... Let's see.

vfdev-5 avatar May 10 '21 20:05 vfdev-5

Well with the following, we can simply reverse the order to write/remove and not implementing "lazy saving" (atomic would cover this I believe).

Actually starting version (v1.0.0) there is now an API to delete remote files (even on the default http server).

After this change and with n_saved=1, there may be no need to implement a lazy save for ClearMLSaver, but only to remark atomic option in the docs in a way that it can be used to prevent any corruption. So atomic option simply became more effective and it is a trivial solution (also for ClearMLSaver). However, for the following

for all clearml versions (if possible...)

I could not come up with any immediate solution yet.

devrimcavusoglu avatar May 10 '21 23:05 devrimcavusoglu

I opened a PR which can be a first draft for

Well with the following, we can simply reverse the order to write/remove and not implementing "lazy saving" (atomic would cover this I believe).

which we can iterate over altough note that with this change clearml>=1.0.0 would now be required.

devrimcavusoglu avatar May 12 '21 01:05 devrimcavusoglu