OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

[Bug]: Validation Error

Open djp3k05 opened this issue 1 year ago • 6 comments

What happened?

Activated the Validation option and got this error (training stops after the error):

enumerating sample paths: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 500.16it/s] caching: 100%|█████████████████████████████████████████████████████████████████████████| 32/32 [00:05<00:00, 5.91it/s] validation_step: 0%| | 0/9 [00:00<?, ?it/s] step: 0%| | 0/2 [00:15<?, ?it/s, loss=0.0637, smooth loss=0.0637] epoch: 0%| | 0/2000 [00:32<?, ?it/s] Traceback (most recent call last): File "D:\SD\OneTrainer\modules\ui\TrainUI.py", line 553, in __training_thread_function trainer.train() File "D:\SD\OneTrainer\modules\trainer\GenericTrainer.py", line 701, in train self.__validate(train_progress) File "D:\SD\OneTrainer\modules\trainer\GenericTrainer.py", line 338, in __validate model_output_data = self.model_setup.predict( File "D:\SD\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 234, in predict text_encoder_output, pooled_text_encoder_2_output = model.encode_text( File "D:\SD\OneTrainer\modules\model\StableDiffusionXLModel.py", line 226, in encode_text text_encoder_1_output, _ = encode_clip( File "D:\SD\OneTrainer\modules\model\util\clip_util.py", line 23, in encode_clip text_encoder_output = text_encoder( File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "D:\SD\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 806, in forward return self.text_model( File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "D:\SD\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 698, in forward hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids) File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "D:\SD\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 218, in forward inputs_embeds = self.token_embedding(input_ids) File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "D:\SD\OneTrainer\modules\module\AdditionalEmbeddingWrapper.py", line 42, in forward return F.embedding( File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\functional.py", line 2264, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

What did you expect would happen?

to work

Relevant log output

No response

Output of pip freeze

No response

djp3k05 avatar Sep 10 '24 06:09 djp3k05

finetune, sdxl, xformers, unet+te1+te2

djp3k05 avatar Sep 10 '24 10:09 djp3k05

I had a tensors-not-on-the-same-device error in the same place when I tried running with validation as well...

Probably not the best way to fix it (i get ooms and have to switch to sysmem fallback, which is not ideal) but I at least got it to start running by adding self.model.to(self.train_device) just above torch_gc() here in modules/trainer/GenericTrainer.py:

https://github.com/Nerogar/OneTrainer/blob/41f6b4307b8e87b0ab3d3e97ddc212f38d4977e9/modules/trainer/GenericTrainer.py#L313-L321

hope this can be helpful

Muskworker avatar Sep 21 '24 04:09 Muskworker

Please send config and training_concepts\concepts.json

seniorsolt avatar Sep 22 '24 17:09 seniorsolt

here's mine

concepts.json failing_validate_config.json

*eta: training starts fine with validation turned off.

Muskworker avatar Sep 22 '24 21:09 Muskworker

seems like that's issue with latent caching, try to disable it or use cuda as temp device

seniorsolt avatar Sep 23 '24 12:09 seniorsolt

Yes, disabling latent caching (and restarting onetrainer) allowed it to run with validation in my case. 👍

Muskworker avatar Sep 23 '24 14:09 Muskworker

Please try to update. It should be fixed now.

Nerogar avatar Oct 13 '24 12:10 Nerogar