OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

[Bug]: Artifacts on left and bottom sides of images with SD3.5M LoRAs

Open xzuyn opened this issue 11 months ago • 13 comments

What happened?

I'm seeing artifacts with LoRAs I trained on SD3.5M, but only when generated outside of OneTrainer. The sample images generated during training look perfectly fine, but the ones in ComfyUI (even with the same settings) are mangled.

I'm unsure if I should be making an issue page here or with ComfyUI since I tested somebody elses SD3.5M LoRA and that seemed to work perfectly fine, even with the same long prompts.

I've tried playing around with different settings, adding EMA, different optimizers, and nothing has helped.

At first I thought it may be because some of my training samples contains captions which were >256 T5 tokens, so I removed any of those samples and tried again. The problem still occured. I was then told that SD3.5 supposedly doesn't support 256 max token length, and its real max is only 154 for T5, so I tried again with all captions exceeding that removed. Again, the same problem. If the problem were the "max length", it wouldn't explain why the trainer generated images look fine anyway.

So now I'm out of ideas on any mistakes I could have made. Any ideas?


Samples generated in ComfyUI using a LoRA I created with OneTrainer. Prompts are basically all more than 120 tokens.


Samples generated during training (left) vs samples generated in ComfyUI (right), both with the same settings.

.

.

What did you expect would happen?

I expected the LoRA to "just work".

Relevant log output


Output of pip freeze

absl-py==2.1.0
accelerate==1.0.1
aiodns==3.2.0
aiohappyeyeballs==2.4.4
aiohttp==3.11.11
aiohttp-retry==2.9.1
aiosignal==1.3.2
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.8.0
attrs==24.3.0
backoff==2.2.1
bcrypt==4.2.1
bitsandbytes @ file:///media/xzuyn/NVMe/LClones/OneTrainer/bnb
boto3==1.36.1
botocore==1.36.1
Brotli==1.1.0
certifi==2024.12.14
cffi==1.17.1
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.1
colorama==0.4.6
coloredlogs==15.0.1
contourpy==1.3.1
cryptography==43.0.3
customtkinter==5.2.2
cycler==0.12.1
dadaptation==3.2
darkdetect==0.8.0
decorator==5.1.1
Deprecated==1.2.15
-e git+https://github.com/huggingface/diffusers.git@55ac1dbdf2e77dcc93b0fa87d638d074219922e4#egg=diffusers
dnspython==2.7.0
einops==0.8.0
email_validator==2.2.0
fabric==3.2.2
fastapi==0.115.6
fastapi-cli==0.0.7
filelock==3.16.1
flatbuffers==24.12.23
fonttools==4.55.3
frozenlist==1.5.0
fsspec==2024.10.0
ftfy==6.3.1
grpcio==1.69.0
h11==0.14.0
httpcore==1.0.7
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.26.2
humanfriendly==10.0
idna==3.10
importlib_metadata==8.5.0
iniconfig==2.0.0
inquirerpy==0.3.4
invisible-watermark==0.2.0
invoke==2.2.0
itsdangerous==2.2.0
Jinja2==3.1.4
jmespath==1.0.1
kiwisolver==1.4.8
lightning-utilities==0.11.9
lion-pytorch==0.2.2
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.2
mdurl==0.1.2
-e git+https://github.com/Nerogar/mgds.git@e6bd96b0cf0d127a8a721bdbf218e4e5aa6c16f8#egg=mgds
mpmath==1.3.0
multidict==6.1.0
networkx==3.4.2
numpy==1.26.4
nvidia-ml-py==12.560.30
omegaconf==2.3.0
onnxruntime-gpu==1.20.1
open_clip_torch==2.28.0
opencv-python==4.10.0.84
orjson==3.10.14
packaging==24.2
pandas==2.2.3
paramiko==3.5.0
pfzy==0.3.4
pillow==11.0.0
platformdirs==4.3.6
pluggy==1.5.0
pooch==1.8.2
prettytable==3.12.0
prodigyopt==1.1.1
prompt_toolkit==3.0.48
propcache==0.2.1
protobuf==5.29.3
psutil==6.1.1
py-cpuinfo==9.0.0
pycares==4.5.0
pycparser==2.22
pydantic==2.10.5
pydantic-extra-types==2.10.2
pydantic-settings==2.7.1
pydantic_core==2.27.2
Pygments==2.19.1
PyNaCl==1.5.0
pyparsing==3.2.1
pytest==8.3.4
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
pytorch-lightning==2.4.0
pytorch-triton-rocm==3.2.0+git0d4682f0
pytorch_optimizer==3.3.0
pytz==2024.2
PyWavelets==1.8.0
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
rich==13.9.4
rich-toolkit==0.13.2
runpod==1.7.4
s3transfer==0.11.1
safetensors==0.4.5
scalene==1.5.45
schedulefree==1.3
scipy==1.14.1
sentencepiece==0.2.0
setuptools==70.2.0
shellingham==1.5.4
six==1.17.0
sniffio==1.3.1
starlette==0.41.3
sympy==1.13.1
tensorboard==2.18.0
tensorboard-data-server==0.7.2
timm==1.0.13
tokenizers==0.21.0
tomli==2.2.1
tomlkit==0.13.2
torch==2.7.0.dev20250116+rocm6.3
torchmetrics==1.6.1
torchvision==0.22.0.dev20250116+rocm6.3
tqdm==4.66.6
tqdm-loggable==0.2
transformers==4.47.0
triton==3.1.0
typer==0.15.1
typing_extensions==4.12.2
tzdata==2024.2
ujson==5.10.0
urllib3==2.3.0
uvicorn==0.34.0
uvloop==0.21.0
watchdog==6.0.0
watchfiles==1.0.4
wcwidth==0.2.13
websockets==14.1
Werkzeug==3.1.3
wheel==0.43.0
wrapt==1.17.2
yarl==1.18.3
zipp==3.21.0

xzuyn avatar Jan 22 '25 19:01 xzuyn

Maybe it could be related to not using timestep shifting? #653

xzuyn avatar Jan 24 '25 05:01 xzuyn

do you use (random) rotation in your concept?

dxqb avatar Jan 24 '25 10:01 dxqb

I don't use any of the image augmentation settings. I've also used just 1024 resolution as well as 768, 832, 896, 960, 1024 to do multi-resolution, and they had the same issue.

xzuyn avatar Jan 24 '25 11:01 xzuyn

Without seeing your config I can't really help. But to me these looks like VAE artifacts. For some reason the VAE has created a different distribution at the left and bottom edge and your lora has learned that different distribution. Can you check if you accidentally enabled the force circular padding option on the train tab?

Nerogar avatar Jan 24 '25 17:01 Nerogar

I didn't have it enabled. This was the last run I tried.

example_problem_config.json

xzuyn avatar Jan 24 '25 22:01 xzuyn

Any ideas? I've tried making a few other LoRAs with different datasets and different settings, and they still end up with the artifacts.

xzuyn avatar Feb 06 '25 00:02 xzuyn

Any ideas? I've tried making a few other LoRAs with different datasets and different settings, and they still end up with the artifacts.

Can you try a baseline run without changing ANY settings from the default preset? If it doesnt occur then we know its an issue with one of the settings you changed, if it does occur, we know its a a bigger problem

O-J1 avatar Mar 04 '25 03:03 O-J1

I will try to do some more tests in the next few days, I've had other issues with ML stuff get fixed recently, so it's possible this problem could be solved too.

xzuyn avatar Mar 13 '25 20:03 xzuyn

I have the same problem on training SD3.5L LoRA. And I also tried various training settings (I use sd-scripts), but the problem persists. While in ComfyUI, setting the LoRA strength under 0.5, prompts under 250 tokens, CFG around 4, and using a non-default image size like 896x896 (instead of the default 1024x1024) can somehow make things better.

jinxishe avatar Apr 07 '25 11:04 jinxishe

I will try to do some more tests in the next few days, I've had other issues with ML stuff get fixed recently, so it's possible this problem could be solved too.

Following up.

O-J1 avatar Apr 08 '25 04:04 O-J1

I have the same problem on training SD3.5L LoRA. And I also tried various training settings (I use sd-scripts), but the problem persists. While in ComfyUI, setting the LoRA strength under 0.5, prompts under 250 tokens, CFG around 4, and using a non-default image size like 896x896 (instead of the default 1024x1024) can somehow make things better.

This is the OneTrainer repo. We cannot and should not provide help with another repos training issues which you said was sd-scripts

O-J1 avatar Apr 08 '25 04:04 O-J1

I have the same problem on training SD3.5L LoRA. And I also tried various training settings (I use sd-scripts), but the problem persists. While in ComfyUI, setting the LoRA strength under 0.5, prompts under 250 tokens, CFG around 4, and using a non-default image size like 896x896 (instead of the default 1024x1024) can somehow make things better.

This is the OneTrainer repo. We cannot and should not provide help with another repos training issues which you said was sd-scripts

Thank you for your response. What I meant to say is that this is not an issue with LoRA training, but rather an inherent problem with SD3.5. I just wanted to share the mitigation methods I know of for everyone’s reference. If you think my response is meaningless, please let me know, and I’ll delete it.

jinxishe avatar Apr 08 '25 04:04 jinxishe

Following up.

I have not tried with any default presets yet, but I have tried playing around with settings more. It still has problems.

I then tried with Flex.1-Alpha, and that seems to be completely fine. I gave up on SD3.5M.

xzuyn avatar Apr 09 '25 00:04 xzuyn

Given training is fine I suspect this is a Comfy workflow or comfy code issue.

O-J1 avatar Sep 04 '25 07:09 O-J1