OneTrainer [Bug]: Repeated error during training after 1 epoch and 10% of the next.

[Bug]: Repeated error during training after 1 epoch and 10% of the next.

Open velourlawsuits opened this issue 6 months ago • 3 comments

What happened?

I've been experiencing the same error for the last few months every time I try to train a model (SDXL finetune) I get an error that aborts training after 1 epoch at 10% completion of the next epoch. I've tried doing a reinstallation from scratch, updating, experimenting using different parameters, etc. Nothing seems to work. This was not previously a problem and I have no idea what caused it to suddenly appear but I can't train models past 1 epoch anymore which is very frustrating. Any help would be very appreciated.

EDIT: It appears that the issue is specific to saving the model output in the diffusers format. I just ran a .safetensors output that's now on epoch 4 and counting. I have had false positives a few other times so I'll update this again if the issue shows up again in the next finetune I run, which should be within the week.

What did you expect would happen?

The model should have trained to 10 epochs as I stated in the parameters.

Relevant log output

activating venv D:\Stable Diffusion\OneTrainer\venv
Using Python "D:\Stable Diffusion\OneTrainer\venv\Scripts\python.exe"
Clearing cache directory workspace-cache/cache_1! You can disable this if you want to continue using the same cache.
TensorFlow installation not found - running with reduced feature set.
model.safetensors:   1%|▍                                                         | 21.0M/2.78G [00:01<02:55, 15.7MB/s]Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.17.0 at http://localhost:6006/ (Press CTRL+C to quit)
model.safetensors: 100%|██████████████████████████████████████████████████████████| 2.78G/2.78G [02:41<00:00, 17.2MB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████| 607/607 [00:00<?, ?B/s]
diffusion_pytorch_model.safetensors: 100%|██████████████████████████████████████████| 335M/335M [00:19<00:00, 17.4MB/s]
diffusion_pytorch_model.safetensors: 100%|████████████████████████████████████████| 10.3G/10.3G [10:18<00:00, 16.6MB/s]
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.39it/s]
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.39it/s]D:\Stable Diffusion\OneTrainer\venv\src\diffusers\src\diffusers\models\attention_processor.py:1476: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  hidden_states = F.scaled_dot_product_attention(
caching: 100%|█████████████████████████████████████████████████████████████████████| 5707/5707 [14:27<00:00,  6.58it/s]
caching: 100%|█████████████████████████████████████████████████████████████████████| 5707/5707 [01:43<00:00, 55.37it/s]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 50/50 [00:33<00:00,  1.50it/s]
step: 100%|█████████████████████████████████████| 5707/5707 [4:46:15<00:00,  3.01s/it, loss=0.00497, smooth loss=0.129]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 50/50 [00:32<00:00,  1.53it/s]
Saving workspace/run\save\instagirl2024-08-16_18-43-53-save-5707-1-0                          | 0/5707 [00:00<?, ?it/s]
step:   0%|                                                                                   | 0/5707 [00:43<?, ?it/s]
epoch:  10%|██████▊                                                             | 1/10 [5:03:12<45:28:55, 18192.87s/it]
Traceback (most recent call last):
  File "D:\Stable Diffusion\OneTrainer\modules\ui\TrainUI.py", line 543, in __training_thread_function
    trainer.train()
  File "D:\Stable Diffusion\OneTrainer\modules\trainer\GenericTrainer.py", line 575, in train
    model_output_data = self.model_setup.predict(self.model, batch, self.config, train_progress)
  File "D:\Stable Diffusion\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 280, in predict
    text_encoder_output, pooled_text_encoder_2_output = self.__encode_text(
  File "D:\Stable Diffusion\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 246, in __encode_text
    text_encoder_1_output = model.text_encoder_1(
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 806, in forward
    return self.text_model(
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 698, in forward
    hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 218, in forward
    inputs_embeds = self.token_embedding(input_ids)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\modules\module\AdditionalEmbeddingWrapper.py", line 42, in forward
    return F.embedding(
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\functional.py", line 2264, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Saving models/instagirl

Output of `pip freeze`

No response

Aug 17 '24 01:08 velourlawsuits

OneTrainer OneTrainer copied to clipboard

[Bug]: Repeated error during training after 1 epoch and 10% of the next.

What happened?

What did you expect would happen?

Relevant log output

Output of pip freeze

OneTrainer
OneTrainer copied to clipboard

Output of `pip freeze`