pytorch-operator
pytorch-operator copied to clipboard
Can I freeze pytorchjob training pods and migrate them to other nodes?
You can do it with checkpoint
Yes, @gaocegege. Checkpoints can do this job.
In this way, we have to define what and when to save.
- what: users have to tell us what they want to record, e.g.
epoch
,model_state_dict
,optimizer_state_dict
, and on. - when: this affects when we resume training and the total training cost of the task inside the pytorchjob.
Are there any ways to make this migration more smooth and seamless like a stateless service?
I mean,
- we don't need users to tell us what they want to record.
- the training process is identical to the training without migration.
Currently, I launch a thread to save the checkpoint when container lifecycle preStop
sends a signal. But in this way, users have to change their codes to tell us what they want to record.
Thanks a lot.
we don't need users to tell us what they want to record.
I do not think it is easy. As you know, container dynamic migration is not mature now. Tools like CRIU do not work well. Do you have any idea about it?
I have no ideas and agree with you that this is not easy.
Saving a checkpoint seems to go around this problem now.
Also, this paper has discussed this problem. Gandiva: Introspective Cluster Scheduling for Deep Learning
It is invasive to the user code, personally, I do not think it is practical.
Yes. When we provide a service, we don't want users to change their habits.
This problem seems unsolvable now.
However, if this requirement is necessary, we may have to design a more friendly python library using decorator or combination functions to register some values to be migratable easily.
For example,
model = migratedVaribale(model)
optimizer = migratedVaribale(optimizer)
...
Thanks a lot.
it is a complicated issue, I think. I am glad to review your design proposal if you are interested in such a library.
Thanks a lot.
- For pass-by-reference types like
model
,optimizer
ordict
, this may be easy. - But for pass-by-value types like
int
orfloat
, for now, I don't know how to trace their values properly.
Hi @gaocegege, I've designed two migration solutions here: https://github.com/Shuai-Xie/pytorchjob-migration.
Both solutions use the preStop
container hook and record the signal in a shared file.
This repo has two branches.
-
master: implements
MigratableVariable
with combination function and singleton class, which is more user-friendly and can be used in multiple python modules freely. -
develop: implements
Migrator
class, which is an older version and has limitations noted in README.
To use the migration feature.
- master
# metircs to be recorded
metircs = {'epoch': -1, 'best_epoch': -1, 'best_acc': 0.}
# migration
from migration import MigratableVariable
model = MigratableVariable(model)
optimizer = MigratableVariable(optimizer)
metircs = MigratableVariable(metircs)
- develop
# metircs to be recorded
metircs = {'epoch': -1, 'best_epoch': -1, 'best_acc': 0.}
# migration
from migration import migrator
migrator.register('model', model)
migrator.register('optimizer', optimizer)
migrator.register('metircs', metircs)
migrator.listening()
if migrator.resume: # note: migrate_ckpt has higher priority than args.ckpt
migrator.load_ckpt() # load ckpt at all ranks
Could you please help me review the design?
Many Thanks.