pytorch-operator icon indicating copy to clipboard operation
pytorch-operator copied to clipboard

Can I freeze pytorchjob training pods and migrate them to other nodes?

Open Shuai-Xie opened this issue 3 years ago • 9 comments

Shuai-Xie avatar Sep 22 '21 13:09 Shuai-Xie

You can do it with checkpoint

gaocegege avatar Sep 23 '21 02:09 gaocegege

Yes, @gaocegege. Checkpoints can do this job.

In this way, we have to define what and when to save.

  • what: users have to tell us what they want to record, e.g. epoch, model_state_dict, optimizer_state_dict, and on.
  • when: this affects when we resume training and the total training cost of the task inside the pytorchjob.

Are there any ways to make this migration more smooth and seamless like a stateless service?

I mean,

  • we don't need users to tell us what they want to record.
  • the training process is identical to the training without migration.

Currently, I launch a thread to save the checkpoint when container lifecycle preStop sends a signal. But in this way, users have to change their codes to tell us what they want to record.

Thanks a lot.

Shuai-Xie avatar Sep 24 '21 06:09 Shuai-Xie

we don't need users to tell us what they want to record.

I do not think it is easy. As you know, container dynamic migration is not mature now. Tools like CRIU do not work well. Do you have any idea about it?

gaocegege avatar Sep 24 '21 06:09 gaocegege

I have no ideas and agree with you that this is not easy.

Saving a checkpoint seems to go around this problem now.

Also, this paper has discussed this problem. Gandiva: Introspective Cluster Scheduling for Deep Learning

image

Shuai-Xie avatar Sep 24 '21 06:09 Shuai-Xie

It is invasive to the user code, personally, I do not think it is practical.

gaocegege avatar Sep 24 '21 06:09 gaocegege

Yes. When we provide a service, we don't want users to change their habits.

This problem seems unsolvable now.

However, if this requirement is necessary, we may have to design a more friendly python library using decorator or combination functions to register some values to be migratable easily.

For example,

model = migratedVaribale(model)
optimizer = migratedVaribale(optimizer)
...

Thanks a lot.

Shuai-Xie avatar Sep 24 '21 08:09 Shuai-Xie

it is a complicated issue, I think. I am glad to review your design proposal if you are interested in such a library.

gaocegege avatar Sep 24 '21 08:09 gaocegege

Thanks a lot.

  • For pass-by-reference types like model, optimizer or dict, this may be easy.
  • But for pass-by-value types like int or float, for now, I don't know how to trace their values properly.

Shuai-Xie avatar Sep 24 '21 09:09 Shuai-Xie

Hi @gaocegege, I've designed two migration solutions here: https://github.com/Shuai-Xie/pytorchjob-migration.

Both solutions use the preStop container hook and record the signal in a shared file.

This repo has two branches.

  • master: implements MigratableVariable with combination function and singleton class, which is more user-friendly and can be used in multiple python modules freely.
  • develop: implements Migrator class, which is an older version and has limitations noted in README.

To use the migration feature.

  • master
# metircs to be recorded
metircs = {'epoch': -1, 'best_epoch': -1, 'best_acc': 0.}

# migration
from migration import MigratableVariable
model = MigratableVariable(model)
optimizer = MigratableVariable(optimizer)
metircs = MigratableVariable(metircs)
  • develop
# metircs to be recorded
metircs = {'epoch': -1, 'best_epoch': -1, 'best_acc': 0.}

# migration
from migration import migrator
migrator.register('model', model)
migrator.register('optimizer', optimizer)
migrator.register('metircs', metircs)
migrator.listening()
if migrator.resume:  # note: migrate_ckpt has higher priority than args.ckpt
    migrator.load_ckpt()  # load ckpt at all ranks

Could you please help me review the design?

Many Thanks.

Shuai-Xie avatar Sep 26 '21 15:09 Shuai-Xie