ai-toolkit icon indicating copy to clipboard operation
ai-toolkit copied to clipboard

AssertionError: No inf checks were recorded for this optimizer.

Open MrJumbleo opened this issue 1 year ago • 9 comments

Any ideas? Ive tried to run this a bunch of different ways and this is all I get, even with the vanilla defalut settings.

loss is nan Traceback (most recent call last): File "C:\Users\Downloads\ostris\ai-toolkit\venv\lib\site-packages\gradio\queueing.py", line 536, in process_events response = await route_utils.call_process_api( File "C:\Users\Downloads\ostris\ai-toolkit\venv\lib\site-packages\gradio\route_utils.py", line 321, in call_process_api output = await app.get_blocks().process_api( File "C:\Users\Downloads\ostris\ai-toolkit\venv\lib\site-packages\gradio\blocks.py", line 1935, in process_api result = await self.call_function( File "C:\Users\Downloads\ostris\ai-toolkit\venv\lib\site-packages\gradio\blocks.py", line 1520, in call_function prediction = await anyio.to_thread.run_sync( # type: ignore File "C:\Users\Downloads\ostris\ai-toolkit\venv\lib\site-packages\anyio\to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( File "C:\Users\Downloads\ostris\ai-toolkit\venv\lib\site-packages\anyio_backends_asyncio.py", line 2177, in run_sync_in_worker_thread return await future File "C:\Users\Downloads\ostris\ai-toolkit\venv\lib\site-packages\anyio_backends_asyncio.py", line 859, in run result = context.run(func, args) File "C:\Users\Downloads\ostris\ai-toolkit\venv\lib\site-packages\gradio\utils.py", line 826, in wrapper response = f(args, **kwargs) File "C:\Users\Downloads\ostris\ai-toolkit\flux_train_ui.py", line 230, in start_training job.run() File "C:\Users\Downloads\ostris\ai-toolkit\jobs\ExtensionJob.py", line 22, in run process.run() File "C:\Users\Downloads\ostris\ai-toolkit\jobs\process\BaseSDTrainProcess.py", line 1709, in run loss_dict = self.hook_train_loop(batch_list) File "C:\Users\Downloads\ostris\ai-toolkit\extensions_built_in\sd_trainer\SDTrainer.py", line 1583, in hook_train_loop self.scaler.step(self.optimizer) File "C:\Users\Downloads\ostris\ai-toolkit\venv\lib\site-packages\torch\amp\grad_scaler.py", line 451, in step len(optimizer_state["found_inf_per_device"]) > 0 AssertionError: No inf checks were recorded for this optimizer.

MrJumbleo avatar Sep 05 '24 02:09 MrJumbleo

I have something similar.

1024x1024: 18 files 1 buckets made Caching latents for J:\input\models\testlora

  • Saving latents to disk Caching latents to disk: 100%|███████████████████████████████████████████████████████| 18/18 [00:02<00:00, 8.57it/s] Generating baseline samples before training Generating Images: 0%| | 0/11 [00:00<?, ?it/s]H:\ai-toolkit\venv\lib\site-packages\diffusers\image_processor.py:111: RuntimeWarning: invalid value encountered in cast images = (images * 255).round().astype("uint8") testlora: 0%| | 0/2000 [00:00<?, ?it/s]loss is nan Error running job: No inf checks were recorded for this optimizer.

======================================== Result:

  • 0 completed jobs
  • 1 failure ======================================== Traceback (most recent call last): File "H:\ai-toolkit\run.py", line 90, in main() File "H:\ai-toolkit\run.py", line 86, in main raise e File "H:\ai-toolkit\run.py", line 78, in main job.run() File "H:\ai-toolkit\jobs\ExtensionJob.py", line 22, in run process.run() File "H:\ai-toolkit\jobs\process\BaseSDTrainProcess.py", line 1709, in run loss_dict = self.hook_train_loop(batch_list) File "H:\ai-toolkit\extensions_built_in\sd_trainer\SDTrainer.py", line 1583, in hook_train_loop self.scaler.step(self.optimizer) File "H:\ai-toolkit\venv\lib\site-packages\torch\amp\grad_scaler.py", line 450, in step assert ( AssertionError: No inf checks were recorded for this optimizer. testlora: 0%| | 0/2000 [00:01<?, ?it/s]

(venv) H:\ai-toolkit>

This is driving me nuts... The dataset I am using 18 1024x1024 photos of a subject named testlora1.png / testlora1.txt ...18.txt / ...18.txt I have a Gen 5 M.2 / 64gig of memory oc to 6600 (stable) / i9-14900KF / Rog Strix RTX 4090 24gig... I have CUDA 12.6 and CUDNN 9.4 installed. Also have my venv set for a 3.10.6.

Renamed my testlora.yaml to testlora.txt to upload it... testlora.txt

I am sure it is something I am doing... I am just not sure what. Probably has to do with your warning about not running(?) on Windows perhaps...

Lutraphobia avatar Sep 08 '24 21:09 Lutraphobia

I have something similar.

Caching latents to disk: 100%|█████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 10.36it/s] Skipping first sample due to config setting xiaxia: 0%| | 0/2000 [00:00<?, ?it/s]loss is nan Error running job: No inf checks were recorded for this optimizer.

======================================== Result:

  • 0 completed jobs
  • 1 failure ======================================== Traceback (most recent call last): File "D:\soft\ai-toolkit\run.py", line 90, in main() File "D:\soft\ai-toolkit\run.py", line 86, in main raise e File "D:\soft\ai-toolkit\run.py", line 78, in main job.run() File "D:\soft\ai-toolkit\jobs\ExtensionJob.py", line 22, in run process.run() File "D:\soft\ai-toolkit\jobs\process\BaseSDTrainProcess.py", line 1709, in run loss_dict = self.hook_train_loop(batch_list) File "D:\soft\ai-toolkit\extensions_built_in\sd_trainer\SDTrainer.py", line 1583, in hook_train_loop self.scaler.step(self.optimizer) File "D:\soft\ai-toolkit\venv\lib\site-packages\torch\amp\grad_scaler.py", line 450, in step assert ( AssertionError: No inf checks were recorded for this optimizer. xiaxia: 0%| | 0/2000 [00:01<?, ?it/s]

sh131007545 avatar Sep 10 '24 09:09 sh131007545

Still no clue

To add to my initial post, CPU 5900x GPU 3090 Ram 48GB Storage for days.

I tried running with python, 3.10.6, and a few different versions of python 3.11 Also tried changing out the dataset to an established data set to make sure it was not some wierd file issue.

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:36:15_Pacific_Daylight_Time_2023 Cuda compilation tools, release 12.1, V12.1.105

MrJumbleo avatar Sep 11 '24 00:09 MrJumbleo

I have the same issue.

GPU 4090 RAM 48GB

If you save the first samples, all off the pre-trained images would be black as well. I noticed that if I set quantize to false, at least I am able to get pre-trained images but the whole thing becomes much slower.

Tomorrow will investigate more around it.

omidsakhi avatar Sep 11 '24 04:09 omidsakhi

I have the same issue.

GPU 4090 RAM 48GB

If you save the first samples, all off the pre-trained images would be black as well. I noticed that if I set quantize to false, at least I am able to get pre-trained images but the whole thing becomes much slower.

Tomorrow will investigate more around it.

Thank you sir

MrJumbleo avatar Sep 11 '24 04:09 MrJumbleo

So, it seems that there is an issue with optimum.quanto's way of implementing qfloat8 (which is qfloat8_e4m3fn). The workaround that so far worked for me is to switch the quantization from qfloat8 to qint8:

by importing qint8 at https://github.com/ostris/ai-toolkit/blob/main/toolkit/stable_diffusion_model.py#L62 and changing qfloat8 to qint8 at these two locations:

https://github.com/ostris/ai-toolkit/blob/main/toolkit/stable_diffusion_model.py#L593 https://github.com/ostris/ai-toolkit/blob/main/toolkit/stable_diffusion_model.py#L617

That seems to let the training going and the VRAM memory tops at 21GB.

Would love to here other people's opinion about this to know if it worked for you and the ramification of this change.

omidsakhi avatar Sep 11 '24 16:09 omidsakhi

@omidsakhi I made the changes in the stable_diffusion_model.py like you mentioned, deleted all pycache files and ran again.

Unfortunately no change.

I still get...

Skipping first sample due to config setting
test_flux_lora_v1:   0%|                                                                                      | 0/3000 [00:00<?, ?it/s]loss is nan
Error running job: No inf checks were recorded for this optimizer.

... on my GPU 4090 RAM 24GB on Ubuntu 22.04 Linux

byteconcepts avatar Sep 15 '24 10:09 byteconcepts

I had this issue - I can't confirm for sure but I feel it may have been because I left the optimizer file in the output folder and only deleted the generated loras, after changing lora dim/rank that the issue started, and after deleting it it worked fine next time.

Can't confirm it wasn't a fluke though.

yemmlie avatar Nov 09 '24 20:11 yemmlie

So, it seems that there is an issue with optimum.quanto's way of implementing qfloat8 (which is qfloat8_e4m3fn). The workaround that so far worked for me is to switch the quantization from qfloat8 to qint8:

by importing qint8 at https://github.com/ostris/ai-toolkit/blob/main/toolkit/stable_diffusion_model.py#L62 and changing qfloat8 to qint8 at these two locations:

https://github.com/ostris/ai-toolkit/blob/main/toolkit/stable_diffusion_model.py#L593 https://github.com/ostris/ai-toolkit/blob/main/toolkit/stable_diffusion_model.py#L617

That seems to let the training going and the VRAM memory tops at 21GB.

Would love to here other people's opinion about this to know if it worked for you and the ramification of this change.

This solution worked for me. I changed all references in stable_diffusion_model.py of qfloat8 to qint8. I believe it was 5 locations, the import and 4 other spots.

the320x200 avatar Nov 18 '24 00:11 the320x200