[Bug]: Artifacts on left and bottom sides of images with SD3.5M LoRAs
What happened?
I'm seeing artifacts with LoRAs I trained on SD3.5M, but only when generated outside of OneTrainer. The sample images generated during training look perfectly fine, but the ones in ComfyUI (even with the same settings) are mangled.
I'm unsure if I should be making an issue page here or with ComfyUI since I tested somebody elses SD3.5M LoRA and that seemed to work perfectly fine, even with the same long prompts.
I've tried playing around with different settings, adding EMA, different optimizers, and nothing has helped.
At first I thought it may be because some of my training samples contains captions which were >256 T5 tokens, so I removed any of those samples and tried again. The problem still occured. I was then told that SD3.5 supposedly doesn't support 256 max token length, and its real max is only 154 for T5, so I tried again with all captions exceeding that removed. Again, the same problem. If the problem were the "max length", it wouldn't explain why the trainer generated images look fine anyway.
So now I'm out of ideas on any mistakes I could have made. Any ideas?
Samples generated in ComfyUI using a LoRA I created with OneTrainer. Prompts are basically all more than 120 tokens.
Samples generated during training (left) vs samples generated in ComfyUI (right), both with the same settings.
.
.
What did you expect would happen?
I expected the LoRA to "just work".
Relevant log output
Output of pip freeze
absl-py==2.1.0
accelerate==1.0.1
aiodns==3.2.0
aiohappyeyeballs==2.4.4
aiohttp==3.11.11
aiohttp-retry==2.9.1
aiosignal==1.3.2
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.8.0
attrs==24.3.0
backoff==2.2.1
bcrypt==4.2.1
bitsandbytes @ file:///media/xzuyn/NVMe/LClones/OneTrainer/bnb
boto3==1.36.1
botocore==1.36.1
Brotli==1.1.0
certifi==2024.12.14
cffi==1.17.1
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.1
colorama==0.4.6
coloredlogs==15.0.1
contourpy==1.3.1
cryptography==43.0.3
customtkinter==5.2.2
cycler==0.12.1
dadaptation==3.2
darkdetect==0.8.0
decorator==5.1.1
Deprecated==1.2.15
-e git+https://github.com/huggingface/diffusers.git@55ac1dbdf2e77dcc93b0fa87d638d074219922e4#egg=diffusers
dnspython==2.7.0
einops==0.8.0
email_validator==2.2.0
fabric==3.2.2
fastapi==0.115.6
fastapi-cli==0.0.7
filelock==3.16.1
flatbuffers==24.12.23
fonttools==4.55.3
frozenlist==1.5.0
fsspec==2024.10.0
ftfy==6.3.1
grpcio==1.69.0
h11==0.14.0
httpcore==1.0.7
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.26.2
humanfriendly==10.0
idna==3.10
importlib_metadata==8.5.0
iniconfig==2.0.0
inquirerpy==0.3.4
invisible-watermark==0.2.0
invoke==2.2.0
itsdangerous==2.2.0
Jinja2==3.1.4
jmespath==1.0.1
kiwisolver==1.4.8
lightning-utilities==0.11.9
lion-pytorch==0.2.2
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.2
mdurl==0.1.2
-e git+https://github.com/Nerogar/mgds.git@e6bd96b0cf0d127a8a721bdbf218e4e5aa6c16f8#egg=mgds
mpmath==1.3.0
multidict==6.1.0
networkx==3.4.2
numpy==1.26.4
nvidia-ml-py==12.560.30
omegaconf==2.3.0
onnxruntime-gpu==1.20.1
open_clip_torch==2.28.0
opencv-python==4.10.0.84
orjson==3.10.14
packaging==24.2
pandas==2.2.3
paramiko==3.5.0
pfzy==0.3.4
pillow==11.0.0
platformdirs==4.3.6
pluggy==1.5.0
pooch==1.8.2
prettytable==3.12.0
prodigyopt==1.1.1
prompt_toolkit==3.0.48
propcache==0.2.1
protobuf==5.29.3
psutil==6.1.1
py-cpuinfo==9.0.0
pycares==4.5.0
pycparser==2.22
pydantic==2.10.5
pydantic-extra-types==2.10.2
pydantic-settings==2.7.1
pydantic_core==2.27.2
Pygments==2.19.1
PyNaCl==1.5.0
pyparsing==3.2.1
pytest==8.3.4
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
pytorch-lightning==2.4.0
pytorch-triton-rocm==3.2.0+git0d4682f0
pytorch_optimizer==3.3.0
pytz==2024.2
PyWavelets==1.8.0
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
rich==13.9.4
rich-toolkit==0.13.2
runpod==1.7.4
s3transfer==0.11.1
safetensors==0.4.5
scalene==1.5.45
schedulefree==1.3
scipy==1.14.1
sentencepiece==0.2.0
setuptools==70.2.0
shellingham==1.5.4
six==1.17.0
sniffio==1.3.1
starlette==0.41.3
sympy==1.13.1
tensorboard==2.18.0
tensorboard-data-server==0.7.2
timm==1.0.13
tokenizers==0.21.0
tomli==2.2.1
tomlkit==0.13.2
torch==2.7.0.dev20250116+rocm6.3
torchmetrics==1.6.1
torchvision==0.22.0.dev20250116+rocm6.3
tqdm==4.66.6
tqdm-loggable==0.2
transformers==4.47.0
triton==3.1.0
typer==0.15.1
typing_extensions==4.12.2
tzdata==2024.2
ujson==5.10.0
urllib3==2.3.0
uvicorn==0.34.0
uvloop==0.21.0
watchdog==6.0.0
watchfiles==1.0.4
wcwidth==0.2.13
websockets==14.1
Werkzeug==3.1.3
wheel==0.43.0
wrapt==1.17.2
yarl==1.18.3
zipp==3.21.0
Maybe it could be related to not using timestep shifting? #653
do you use (random) rotation in your concept?
I don't use any of the image augmentation settings. I've also used just 1024 resolution as well as 768, 832, 896, 960, 1024 to do multi-resolution, and they had the same issue.
Without seeing your config I can't really help. But to me these looks like VAE artifacts. For some reason the VAE has created a different distribution at the left and bottom edge and your lora has learned that different distribution. Can you check if you accidentally enabled the force circular padding option on the train tab?
Any ideas? I've tried making a few other LoRAs with different datasets and different settings, and they still end up with the artifacts.
Any ideas? I've tried making a few other LoRAs with different datasets and different settings, and they still end up with the artifacts.
Can you try a baseline run without changing ANY settings from the default preset? If it doesnt occur then we know its an issue with one of the settings you changed, if it does occur, we know its a a bigger problem
I will try to do some more tests in the next few days, I've had other issues with ML stuff get fixed recently, so it's possible this problem could be solved too.
I have the same problem on training SD3.5L LoRA. And I also tried various training settings (I use sd-scripts), but the problem persists. While in ComfyUI, setting the LoRA strength under 0.5, prompts under 250 tokens, CFG around 4, and using a non-default image size like 896x896 (instead of the default 1024x1024) can somehow make things better.
I will try to do some more tests in the next few days, I've had other issues with ML stuff get fixed recently, so it's possible this problem could be solved too.
Following up.
I have the same problem on training SD3.5L LoRA. And I also tried various training settings (I use sd-scripts), but the problem persists. While in ComfyUI, setting the LoRA strength under 0.5, prompts under 250 tokens, CFG around 4, and using a non-default image size like 896x896 (instead of the default 1024x1024) can somehow make things better.
This is the OneTrainer repo. We cannot and should not provide help with another repos training issues which you said was sd-scripts
I have the same problem on training SD3.5L LoRA. And I also tried various training settings (I use sd-scripts), but the problem persists. While in ComfyUI, setting the LoRA strength under 0.5, prompts under 250 tokens, CFG around 4, and using a non-default image size like 896x896 (instead of the default 1024x1024) can somehow make things better.
This is the OneTrainer repo. We cannot and should not provide help with another repos training issues which you said was sd-scripts
Thank you for your response. What I meant to say is that this is not an issue with LoRA training, but rather an inherent problem with SD3.5. I just wanted to share the mitigation methods I know of for everyone’s reference. If you think my response is meaningless, please let me know, and I’ll delete it.
Following up.
I have not tried with any default presets yet, but I have tried playing around with settings more. It still has problems.
I then tried with Flex.1-Alpha, and that seems to be completely fine. I gave up on SD3.5M.
Given training is fine I suspect this is a Comfy workflow or comfy code issue.