kohya_ss icon indicating copy to clipboard operation
kohya_ss copied to clipboard

fail to train

Open ahshrimp opened this issue 2 years ago • 5 comments

I was able to train the model successfully in 2 weeks ago, but today when I try to train again, it fail with OOM error, any idea how to fix it? thanks

Traceback (most recent call last): File "E:\git\kohya_ss\train_db.py", line 364, in train(args) File "E:\git\kohya_ss\train_db.py", line 277, in train accelerator.backward(loss) File "E:\git\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1314, in backward self.scaler.scale(loss).backward(**kwargs) File "E:\git\kohya_ss\venv\lib\site-packages\torch_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "E:\git\kohya_ss\venv\lib\site-packages\torch\autograd_init_.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "E:\git\kohya_ss\venv\lib\site-packages\torch\autograd\function.py", line 253, in apply return user_fn(self, *args) File "E:\git\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py", line 130, in backward outputs = ctx.run_function(*detached_inputs) File "E:\git\kohya_ss\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 643, in custom_forward return module(*inputs, output_attentions) File "E:\git\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "E:\git\kohya_ss\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 393, in forward hidden_states = self.mlp(hidden_states) File "E:\git\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "E:\git\kohya_ss\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 348, in forward hidden_states = self.fc1(hidden_states) File "E:\git\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "E:\git\kohya_ss\venv\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 8.00 GiB total capacity; 7.22 GiB already allocated; 0 bytes free; 7.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

ahshrimp avatar Mar 17 '23 18:03 ahshrimp

You could try to rever back to the release you used two week ago and confirm that all is OK. Kohya constantly update his trainer code and can introduce differences in how much resources it uses.

To revert to a previous release use the following commands:

git checkout <release name>
upgrade.ps1

This will bring you back to that release code base.

To go back to the current code do:

git checkout master
upgrade.ps1

bmaltais avatar Mar 17 '23 19:03 bmaltais

I tried entering the above code in Git CMD, CMD, and powershell...none of which worked. Here is the code I entered:

git checkout 21.1.0
upgrade.ps1

Additionally, in both Git CMD and powershell, I get the following message when I enter it:

PS S:\kohya_ss> git checkout 21.1.0
>> upgrade.ps1
fatal: not a git repository (or any of the parent directories): .git
upgrade.ps1 : The term 'upgrade.ps1' is not recognized as the name of a cmdlet, function, script file, or operable
program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again.
At line:2 char:1
+ upgrade.ps1
+ ~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (upgrade.ps1:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException

PS S:\kohya_ss>

Deejay85 avatar Mar 18 '23 18:03 Deejay85

You did not add the v in front of the release name. Try git checkout v21.1.0

bmaltais avatar Mar 18 '23 22:03 bmaltais

Nope...already tried that, and here is the corresponding error message.

upgrade.ps1 : The term 'upgrade.ps1' is not recognized as the name of a cmdlet, function, script file, or operable
program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again.
At line:2 char:1
+ upgrade.ps1
+ ~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (upgrade.ps1:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException

Deejay85 avatar Mar 19 '23 02:03 Deejay85

Are you running powershell of the old CMD terminal? the GUI is expecting a powershell terminal to run properly. CMD sort of work but can be hit and miss... and the upgrade.ps1 error make me think you are not using a powershell terminal environment.

bmaltais avatar Mar 20 '23 11:03 bmaltais