Dreambooth-Stable-Diffusion icon indicating copy to clipboard operation
Dreambooth-Stable-Diffusion copied to clipboard

OSError: cannot open resource???

Open Maki9009 opened this issue 3 years ago • 17 comments

i trains the first 332 then when its done it trains again to 189 and the i get a "OSError: cannot open resource?"

Traceback (most recent call last): File "main.py", line 830, in trainer.fit(model, data) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run results = self._run_stage() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage return self._run_train() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train self.fit_loop.run() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 231, in advance self.trainer._call_callback_hooks("on_train_batch_end", batch_end_outputs, batch, batch_idx, **extra_kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1630, in _call_callback_hooks self._on_train_batch_end(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1660, in _on_train_batch_end callback.on_train_batch_end(self, self.lightning_module, outputs, batch, batch_idx, 0) File "/workspace/Dreambooth-Stable-Diffusion-main/main.py", line 456, in on_train_batch_end self.log_img(pl_module, batch, batch_idx, split="train") File "/workspace/Dreambooth-Stable-Diffusion-main/main.py", line 424, in log_img images = pl_module.log_images(batch, split=split, **self.log_images_kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/workspace/Dreambooth-Stable-Diffusion-main/ldm/models/diffusion/ddpm.py", line 1343, in log_images xc = log_txt_as_img((x.shape[2], x.shape[3]), batch["caption"]) File "/workspace/Dreambooth-Stable-Diffusion-main/ldm/util.py", line 25, in log_txt_as_img font = ImageFont.truetype('data/DejaVuSans.ttf', size=size) File "/opt/conda/lib/python3.7/site-packages/PIL/ImageFont.py", line 844, in truetype return freetype(font) File "/opt/conda/lib/python3.7/site-packages/PIL/ImageFont.py", line 841, in freetype return FreeTypeFont(font, size, index, encoding, layout_engine) File "/opt/conda/lib/python3.7/site-packages/PIL/ImageFont.py", line 194, in init font, size, index, encoding, layout_engine=layout_engine OSError: cannot open resource

Maki9009 avatar Sep 10 '22 21:09 Maki9009

Download DejaVuSans.ttf file from internet and put it in Dreambooth-Stable-Diffusion/data/DejaVuSans.ttf location

sausax avatar Sep 10 '22 21:09 sausax

Download DejaVuSans.ttf file from internet and put it in Dreambooth-Stable-Diffusion/data/DejaVuSans.ttf location

can you explain to me why? plus my directory isn't like that

Maki9009 avatar Sep 10 '22 22:09 Maki9009

You’ll have to make the folder if it doesn’t exist.

the reason you’re doing this is because the program needs the font when generating text on some of the sample pictures, and the missing file causes an error.

nikopueringer avatar Sep 11 '22 01:09 nikopueringer

You’ll have to make the folder if it doesn’t exist.

the reason you’re doing this is because the program needs the font when generating text on some of the sample pictures, and the missing file causes an error.

alright, thanks... i did the training. i tried to run it on but it wouldn't load it on Hlky repo, id ends up with ^C. have you tried to use the new checkpoint on any other repo or locally to generate images, or did you generate them in dream booth also?

Maki9009 avatar Sep 11 '22 04:09 Maki9009

I moved my model into hlky. Should work fine. “^C” is the keyboard command ctrl+c, which is interrupt command for Linux/python. You’re probably pasting something weird in the command line?

nikopueringer avatar Sep 11 '22 05:09 nikopueringer

I moved my model into hlky. Should work fine. “^C” is the keyboard command ctrl+c, which is interrupt command for Linux/python. You’re probably pasting something weird in the command line?

I didn't really change anything, I just gave it the location of the new model. but it was the model that crashed cuz of the OS error. So I've decided to retain now. ur samples are really good? could I know how many regularization images did you use? how many images of yourself did you use?

Maki9009 avatar Sep 11 '22 05:09 Maki9009

I used about 12 photos for regularization (I used “man” as my prompt and generated 12 512x512 images), and I used about the same amount of photos of me, in various lighting conditions, angles, and expressions. I also used “man” as the class that I trained to. Currently doing a test with 100 regularization images.

I trained 4000 iterations (finetune unfrozen .yaml file is what you need to edit to train more than 800 iterations). Look at the end of the file for global iteration cutoff threshold.

nikopueringer avatar Sep 11 '22 05:09 nikopueringer

I used about 12 photos for regularization (I used “man” as my prompt and generated 12 512x512 images), and I used about the same amount of photos of me, in various lighting conditions, angles, and expressions. I also used “man” as the class that I trained to. Currently doing a test with 100 regularization images.

I trained 4000 iterations (finetune unfrozen .yaml file is what you need to edit to train more than 800 iterations). Look at the end of the file for global iteration cutoff threshold.

Well it stopped training again

Average Peak memory 29986.93MiB Epoch 2: 56%|▌| 180/322 [03:02<02:24, 1.01s/it, loss=0.358, v_num=0, train/los

Traceback (most recent call last): File "main.py", line 835, in trainer.test(model, data) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 938, in test return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 985, in _test_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1160, in _run verify_loop_configurations(self) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 46, in verify_loop_configurations __verify_eval_loop_configuration(trainer, model, "test") File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 197, in __verify_eval_loop_configuration raise MisconfigurationException(f"No {loader_name}() method defined to run Trainer.{trainer_method}.") pytorch_lightning.utilities.exceptions.MisconfigurationException: No test_dataloader() method defined to run Trainer.test.

IDK if it is finished... or if it's an error. Also for your really good results, you trained for 4000 iterations? how long did that take? im currently using A100

Maki9009 avatar Sep 11 '22 05:09 Maki9009

@Maki9009

If you want to train for a longer period of time, you can just replace trainer.test(model, data). Either that or you can use a boolean with argparse, or just pass the check.

The error happens at line https://github.com/XavierXiao/Dreambooth-Stable-Diffusion/blob/bb8f4f2dc1d8d1b9ce4f705d03621e6ac8e50028/main.py#L835

Just replace it with:

print("I don't want to test :-(")

Or

pass

ExponentialML avatar Sep 11 '22 06:09 ExponentialML

@ExponentialML

If you want to train for a longer period of time, you can just replace trainer.test(model, data). Either that or you can use a boolean with argparse, or just pass the check.

The error happens at line

https://github.com/XavierXiao/Dreambooth-Stable-Diffusion/blob/bb8f4f2dc1d8d1b9ce4f705d03621e6ac8e50028/main.py#L835

Just replace it with:

print("I don't want to test :-(")

Or

pass

so technically it finished training? cuz it was still set at 800iter... now I'm doing 2000 currently. would I still get the same error? cuz I haven't changed this yet?

Maki9009 avatar Sep 11 '22 06:09 Maki9009

@Maki9009 Yes it still technically finished training. Removing this test option allows it to train for longer (not the option itself, it's a bug that's missing some params in this repo, but you can possibly overfit what you're trying to train.

ExponentialML avatar Sep 11 '22 08:09 ExponentialML

@Maki9009 Yes it still technically finished training. Removing this test option allows it to train for longer (not the option itself, it's a bug that's missing some params in this repo, but you can possibly overfit what you're trying to train.

i finished training... but i have two .ckpt files one is called "last.ckpt" and the other is called "epoch=000001.ckpt" idk which ones the im supposed to use. And if i was to apply it to hlky all i need to do is point the repo to the model right nothing else?

Maki9009 avatar Sep 11 '22 08:09 Maki9009

@Maki9009 Yes it still technically finished training. Removing this test option allows it to train for longer (not the option itself, it's a bug that's missing some params in this repo, but you can possibly overfit what you're trying to train.

i finished training... but i have two .ckpt files one is called "last.ckpt" and the other is called "epoch=000001.ckpt" idk which ones the im supposed to use. And if i was to apply it to hlky all i need to do is point the repo to the model right nothing else?

The epoch file is a save point at epoch number 000001, and the last.ckpt is the latest model that saved when training was finished. A use case is if you feel that last.ckpt has too much training, you can fall back to one of the epoch checkpoints. Either one is fine, but last.ckpt will probably have less editability but more identity preservation. It's your call on which one is best for you.

ExponentialML avatar Sep 11 '22 08:09 ExponentialML

@Maki9009 Yes it still technically finished training. Removing this test option allows it to train for longer (not the option itself, it's a bug that's missing some params in this repo, but you can possibly overfit what you're trying to train.

i finished training... but i have two .ckpt files one is called "last.ckpt" and the other is called "epoch=000001.ckpt" idk which ones the im supposed to use. And if i was to apply it to hlky all i need to do is point the repo to the model right nothing else?

Use latest.ckpt for the fully trained model. And yes, you can just point hlky at the model, or name it “model.ckpt” and replace the current model.ckpt you’re using

nikopueringer avatar Sep 11 '22 08:09 nikopueringer

Oh, and show us your results!

nikopueringer avatar Sep 11 '22 08:09 nikopueringer

Oh, and show us your results!

the training samples or meh, but I kinda expected that since. my reg images or from clip front..which I've been told is not the best method. I'm trying to get it run on free colab currently.. but it requires more Ram. it can't load up the full model. so ill try locally in a bit and hope the results look nice enough. or hope it even runs.

Maki9009 avatar Sep 11 '22 09:09 Maki9009

also, Emad said in the discord chat, that they'll be releasing guides this week so HOPEFULLY. IT HAS SOME TROUBLESHOOTING FOR THIS. because I'm basically setting Google servers and runpods servers on fire.

Maki9009 avatar Sep 11 '22 09:09 Maki9009