pytorch-operator Can I freeze pytorchjob training pods and migrate them to other nodes?

Sep 22 '21 13:09 Shuai-Xie

You can do it with checkpoint

Sep 23 '21 02:09 gaocegege

Yes, @gaocegege. Checkpoints can do this job.

In this way, we have to define what and when to save.

what: users have to tell us what they want to record, e.g. epoch, model_state_dict, optimizer_state_dict, and on.
when: this affects when we resume training and the total training cost of the task inside the pytorchjob.

Are there any ways to make this migration more smooth and seamless like a stateless service?

I mean,

we don't need users to tell us what they want to record.
the training process is identical to the training without migration.

Currently, I launch a thread to save the checkpoint when container lifecycle preStop sends a signal. But in this way, users have to change their codes to tell us what they want to record.

Thanks a lot.

Sep 24 '21 06:09 Shuai-Xie

we don't need users to tell us what they want to record.

I do not think it is easy. As you know, container dynamic migration is not mature now. Tools like CRIU do not work well. Do you have any idea about it?

Sep 24 '21 06:09 gaocegege

I have no ideas and agree with you that this is not easy.

Saving a checkpoint seems to go around this problem now.

Also, this paper has discussed this problem. Gandiva: Introspective Cluster Scheduling for Deep Learning

Sep 24 '21 06:09 Shuai-Xie

It is invasive to the user code, personally, I do not think it is practical.

Sep 24 '21 06:09 gaocegege

Yes. When we provide a service, we don't want users to change their habits.

This problem seems unsolvable now.

However, if this requirement is necessary, we may have to design a more friendly python library using decorator or combination functions to register some values to be migratable easily.

For example,

model = migratedVaribale(model)
optimizer = migratedVaribale(optimizer)
...

Thanks a lot.

Sep 24 '21 08:09 Shuai-Xie

it is a complicated issue, I think. I am glad to review your design proposal if you are interested in such a library.

Sep 24 '21 08:09 gaocegege

Thanks a lot.

For pass-by-reference types like model, optimizer or dict, this may be easy.
But for pass-by-value types like int or float, for now, I don't know how to trace their values properly.

Sep 24 '21 09:09 Shuai-Xie

Hi @gaocegege, I've designed two migration solutions here: https://github.com/Shuai-Xie/pytorchjob-migration.

Both solutions use the preStop container hook and record the signal in a shared file.

This repo has two branches.

master: implements MigratableVariable with combination function and singleton class, which is more user-friendly and can be used in multiple python modules freely.
develop: implements Migrator class, which is an older version and has limitations noted in README.

To use the migration feature.

master

# metircs to be recorded
metircs = {'epoch': -1, 'best_epoch': -1, 'best_acc': 0.}

# migration
from migration import MigratableVariable
model = MigratableVariable(model)
optimizer = MigratableVariable(optimizer)
metircs = MigratableVariable(metircs)

develop

# metircs to be recorded
metircs = {'epoch': -1, 'best_epoch': -1, 'best_acc': 0.}

# migration
from migration import migrator
migrator.register('model', model)
migrator.register('optimizer', optimizer)
migrator.register('metircs', metircs)
migrator.listening()
if migrator.resume:  # note: migrate_ckpt has higher priority than args.ckpt
    migrator.load_ckpt()  # load ckpt at all ranks

Could you please help me review the design?

Many Thanks.

Sep 26 '21 15:09 Shuai-Xie

pytorch-operator pytorch-operator copied to clipboard

Can I freeze pytorchjob training pods and migrate them to other nodes?

pytorch-operator
pytorch-operator copied to clipboard