[Bug]: Validation Error
What happened?
Activated the Validation option and got this error (training stops after the error):
enumerating sample paths: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 500.16it/s] caching: 100%|█████████████████████████████████████████████████████████████████████████| 32/32 [00:05<00:00, 5.91it/s] validation_step: 0%| | 0/9 [00:00<?, ?it/s] step: 0%| | 0/2 [00:15<?, ?it/s, loss=0.0637, smooth loss=0.0637] epoch: 0%| | 0/2000 [00:32<?, ?it/s] Traceback (most recent call last): File "D:\SD\OneTrainer\modules\ui\TrainUI.py", line 553, in __training_thread_function trainer.train() File "D:\SD\OneTrainer\modules\trainer\GenericTrainer.py", line 701, in train self.__validate(train_progress) File "D:\SD\OneTrainer\modules\trainer\GenericTrainer.py", line 338, in __validate model_output_data = self.model_setup.predict( File "D:\SD\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 234, in predict text_encoder_output, pooled_text_encoder_2_output = model.encode_text( File "D:\SD\OneTrainer\modules\model\StableDiffusionXLModel.py", line 226, in encode_text text_encoder_1_output, _ = encode_clip( File "D:\SD\OneTrainer\modules\model\util\clip_util.py", line 23, in encode_clip text_encoder_output = text_encoder( File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "D:\SD\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 806, in forward return self.text_model( File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "D:\SD\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 698, in forward hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids) File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "D:\SD\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 218, in forward inputs_embeds = self.token_embedding(input_ids) File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "D:\SD\OneTrainer\modules\module\AdditionalEmbeddingWrapper.py", line 42, in forward return F.embedding( File "D:\SD\OneTrainer\venv\lib\site-packages\torch\nn\functional.py", line 2264, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
What did you expect would happen?
to work
Relevant log output
No response
Output of pip freeze
No response
finetune, sdxl, xformers, unet+te1+te2
I had a tensors-not-on-the-same-device error in the same place when I tried running with validation as well...
Probably not the best way to fix it (i get ooms and have to switch to sysmem fallback, which is not ideal) but I at least got it to start running by adding self.model.to(self.train_device) just above torch_gc() here in modules/trainer/GenericTrainer.py:
https://github.com/Nerogar/OneTrainer/blob/41f6b4307b8e87b0ab3d3e97ddc212f38d4977e9/modules/trainer/GenericTrainer.py#L313-L321
hope this can be helpful
Please send config and training_concepts\concepts.json
here's mine
concepts.json failing_validate_config.json
*eta: training starts fine with validation turned off.
seems like that's issue with latent caching, try to disable it or use cuda as temp device
Yes, disabling latent caching (and restarting onetrainer) allowed it to run with validation in my case. 👍
Please try to update. It should be fixed now.