OneTrainer
OneTrainer copied to clipboard
[Bug]: Repeated error during training after 1 epoch and 10% of the next.
What happened?
I've been experiencing the same error for the last few months every time I try to train a model (SDXL finetune) I get an error that aborts training after 1 epoch at 10% completion of the next epoch. I've tried doing a reinstallation from scratch, updating, experimenting using different parameters, etc. Nothing seems to work. This was not previously a problem and I have no idea what caused it to suddenly appear but I can't train models past 1 epoch anymore which is very frustrating. Any help would be very appreciated.
EDIT: It appears that the issue is specific to saving the model output in the diffusers format. I just ran a .safetensors output that's now on epoch 4 and counting. I have had false positives a few other times so I'll update this again if the issue shows up again in the next finetune I run, which should be within the week.
What did you expect would happen?
The model should have trained to 10 epochs as I stated in the parameters.
Relevant log output
activating venv D:\Stable Diffusion\OneTrainer\venv
Using Python "D:\Stable Diffusion\OneTrainer\venv\Scripts\python.exe"
Clearing cache directory workspace-cache/cache_1! You can disable this if you want to continue using the same cache.
TensorFlow installation not found - running with reduced feature set.
model.safetensors: 1%|▍ | 21.0M/2.78G [00:01<02:55, 15.7MB/s]Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.17.0 at http://localhost:6006/ (Press CTRL+C to quit)
model.safetensors: 100%|██████████████████████████████████████████████████████████| 2.78G/2.78G [02:41<00:00, 17.2MB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████| 607/607 [00:00<?, ?B/s]
diffusion_pytorch_model.safetensors: 100%|██████████████████████████████████████████| 335M/335M [00:19<00:00, 17.4MB/s]
diffusion_pytorch_model.safetensors: 100%|████████████████████████████████████████| 10.3G/10.3G [10:18<00:00, 16.6MB/s]
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.39it/s]
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.39it/s]D:\Stable Diffusion\OneTrainer\venv\src\diffusers\src\diffusers\models\attention_processor.py:1476: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
hidden_states = F.scaled_dot_product_attention(
caching: 100%|█████████████████████████████████████████████████████████████████████| 5707/5707 [14:27<00:00, 6.58it/s]
caching: 100%|█████████████████████████████████████████████████████████████████████| 5707/5707 [01:43<00:00, 55.37it/s]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 50/50 [00:33<00:00, 1.50it/s]
step: 100%|█████████████████████████████████████| 5707/5707 [4:46:15<00:00, 3.01s/it, loss=0.00497, smooth loss=0.129]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 50/50 [00:32<00:00, 1.53it/s]
Saving workspace/run\save\instagirl2024-08-16_18-43-53-save-5707-1-0 | 0/5707 [00:00<?, ?it/s]
step: 0%| | 0/5707 [00:43<?, ?it/s]
epoch: 10%|██████▊ | 1/10 [5:03:12<45:28:55, 18192.87s/it]
Traceback (most recent call last):
File "D:\Stable Diffusion\OneTrainer\modules\ui\TrainUI.py", line 543, in __training_thread_function
trainer.train()
File "D:\Stable Diffusion\OneTrainer\modules\trainer\GenericTrainer.py", line 575, in train
model_output_data = self.model_setup.predict(self.model, batch, self.config, train_progress)
File "D:\Stable Diffusion\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 280, in predict
text_encoder_output, pooled_text_encoder_2_output = self.__encode_text(
File "D:\Stable Diffusion\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 246, in __encode_text
text_encoder_1_output = model.text_encoder_1(
File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 806, in forward
return self.text_model(
File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 698, in forward
hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 218, in forward
inputs_embeds = self.token_embedding(input_ids)
File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Stable Diffusion\OneTrainer\modules\module\AdditionalEmbeddingWrapper.py", line 42, in forward
return F.embedding(
File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\functional.py", line 2264, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Saving models/instagirl
Output of pip freeze
No response