diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Add training options

Open luckybit4755 opened this issue 3 years ago • 6 comments

Continue training existing model

Modified train_unconditional so the model is reloaded if it exists:

export COMMAND="python examples/train_unconditional.py --resolution 32 --num_epochs 10 --train_data_dir training-images --output_dir model"

% ${COMMAND}
creating fresh model
....
% ${COMMAND}
reloading model from model/unet
reloaded model from model/unet
....

Checkpoint periodically on model save

Checkpoint a copy of the model on save: % ${COMMAND} --checkpoint_model_epochs 10 ... checkpointed model/unet to model/checkpoints/checkpoint-2022-08-18+17-17-48 ...

The arguments should be n * save_model_epochs ; where n is an integer >0 or 0 to disable (default)

% tree model/checkpoints
model/checkpoints
└── checkpoint-2022-08-18+17-17-48
    ├── config.json
    └── diffusion_pytorch_model.bin

1 directory, 2 files

Timestamp test_samples

The test_samples are also timestamped to allow visual inspection over time:

% ${COMMAND} --timestamp_test_samples

Show up with names like:

  • test_samples-2022-08-18+17-49-52+000000
  • test_samples-2022-08-18+17-55-34+000009

Script for generating images:

% ./scripts/generate_images.py model 3
modification time on model is 2022-08-18+17-17-48
loading the model from model
loaded the model from model
creating image and saving to generated/model/2022-08-18+17-17-48/image-0000.png
100%|####| 1000/1000 [00:16<00:00, 60.02it/s]
image saved to generated/model/2022-08-18+17-17-48/image-0000.png
creating image and saving to generated/model/2022-08-18+17-17-48/image-0001.png
100%|####| 1000/1000 [00:17<00:00, 65.98it/s]
image saved to generated/model/2022-08-18+17-17-48/image-0001.png
creating image and saving to generated/model/2022-08-18+17-17-48/image-0002.png
100%|####| 1000/1000 [00:17<00:00, 65.31it/s]
image saved to generated/model/2022-08-18+17-17-48/image-0002.png
writing html to generated/model/2022-08-18+17-17-48/images.html
wrote html to generated/model/2022-08-18+17-17-48/images.html

The directory name is based off of the model and timestamp of it's directory.

Successive runs will not clobber existing files but skip over them.

luckybit4755 avatar Aug 18 '22 20:08 luckybit4755

@anton-l could you check here?

patrickvonplaten avatar Aug 23 '22 12:08 patrickvonplaten

I have not worked with transformers before and implemented checkpoints a little differently than I see in run_glue_no_trainer.py and implemented it like so

In the interests of consistency it might make sense to redo the entire checkpoint / reload mechanism along the lines of run_glue_no_trainer

I'll happily mirror that logic if it will save you time so you can work on other things, but if you want to implement it yourself so it completely matches the project gestalt, I completely understand.

luckybit4755 avatar Aug 24 '22 13:08 luckybit4755

Hey @luckybit4755 sorry for the late reply! Yeah, feel free to copy the logic over, or I can include the changes into your PR if you don't mind :)

anton-l avatar Aug 31 '22 16:08 anton-l

Yeah, that's totally fine if you wouldn't mind. Sorry been a bit distracted as well.

luckybit4755 avatar Aug 31 '22 16:08 luckybit4755

Do we want to copy the "resume" training logic here or leave for now? cc @anton-l

patrickvonplaten avatar Sep 13 '22 12:09 patrickvonplaten

@anton-l closing this PR for now since there is no response

patrickvonplaten avatar Oct 07 '22 17:10 patrickvonplaten

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Nov 01 '22 15:11 github-actions[bot]