[Feat]: Long Clip support
Describe your use-case.
I'm asking for support training models with integrated Long Clip_L(246 effective tokens vs 75): https://arxiv.org/abs/2403.15378 I've asked and got the answer that integrating Long Clip_L is possible: https://huggingface.co/zer0int/LongCLIP-GmP-ViT-L-14/discussions/6
What would you like to see as a solution?
As I see things, It can be a checkbox in "Text Encoder 1" with a description like "It's Long Clip_L". When it's checked OT cuts captions after 246 tokens instead of 75
Have you considered alternatives? List them here.
No response
As I have it working locally, but in not upstreamable way I'll write down what I figured out along the way.
Files of LongClip from this repo by default comes as whole ClipModel, OneTrainer by default use ClipTextModel (equivalent to ClipModel.text_model).
If my commit pass https://huggingface.co/zer0int/LongCLIP-GmP-ViT-L-14/commit/59dd3e4d98acf93ef5093091981fe447e947ae1c it will be easier to differentiate between Clip and LongClip just from config.json and set proper max_length in modules/model for models using Clip-L. For now pipeline won't run without changing config or setting somewhere
text_encoder.max_position_embeddings = 248
or
text_encoder.text_config.max_position_embeddings = 248
depending on implementation.
I have no idea how to differentiate between LongClip and Clip when using single file safetensor instead of diffusers format.
You can download longClip in form that should work out of the box using this python code:
from transformers import CLIPTextModel, CLIPTokenizer
tokenizer = CLIPTokenizer.from_pretrained("zer0int/LongCLIP-GmP-ViT-L-14")
model = CLIPTextModel.from_pretrained("zer0int/LongCLIP-GmP-ViT-L-14")
tokenizer.save_pretrained("./tokenizer")
model.save_pretrained("./text_encoder")
Just download SD 1.5 or Flux model in diffusers format and overwrite models text_encoder and tokenizer directories with ones saved by script.
Than use branch from link below to have whole 248 token limit support.
https://github.com/Heasterian/OneTrainer/blob/LongClip/
I do not have Flux downloaded and tested, so let me know if it works as it should as implementation is a little bit different due to two different types of encoders.
I used Comfy to replace Clip to LongClip for one of my models. The combined checkpoint was successfully saved and then loaded, but I got an error trying to render images with it.
[2025-01-04 00:11:27.017] CLIP model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[2025-01-04 00:11:27.122] !!! Exception during processing !!! Error(s) in loading state_dict for SD1ClipModel:
size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).
[2025-01-04 00:11:27.125] Traceback (most recent call last):
File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 327, in execute
output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 202, in get_output_data
return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 174, in _map_node_over_list
process_inputs(input_dict, i)
File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 163, in process_inputs
results.append(getattr(obj, func)(**inputs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\nodes.py", line 568, in load_checkpoint
out = comfy.sd.load_checkpoint_guess_config(ckpt_path, output_vae=True, output_clip=True, embedding_directory=folder_paths.get_folder_paths("embeddings"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 826, in load_checkpoint_guess_config
out = load_state_dict_guess_config(sd, output_vae, output_clip, output_clipvision, embedding_directory, output_model, model_options, te_model_options=te_model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 881, in load_state_dict_guess_config
m, u = clip.load_sd(clip_sd, full_model=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 228, in load_sd
return self.cond_stage_model.load_state_dict(sd, strict=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "g:\StableDiffusion\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 2584, in load_state_dict
raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for SD1ClipModel:
size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).
Well, you are loading model from single file, not diffusers format I mentioned. With safetensors code is falling back to 77 tokens as config does not include info about max position embeddings.
Does comfy save .yaml file along safetensors? If yes, send it here.
Yeah, I know the word "diffusers" but I'm not sure I'm able to work with it No, Comfy doesn't save yaml I've just found that Comfy has an extension to work with diffusers but I can't try it right now https://github.com/Limitex/ComfyUI-Diffusers?tab=readme-ov-file :-(
You can convert model to diffusers format using tool from tools tab in Onetrainer.
Just overwrite text_encoder and tokenizer in resulting directory as I said here: https://github.com/Nerogar/OneTrainer/issues/624#issuecomment-2571331115
I used Comfy to replace Clip to LongClip for one of my models. The combined checkpoint was successfully saved and then loaded, but I got an error trying to render images with it.
[2025-01-04 00:11:27.017] CLIP model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16 [2025-01-04 00:11:27.122] !!! Exception during processing !!! Error(s) in loading state_dict for SD1ClipModel: size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]). [2025-01-04 00:11:27.125] Traceback (most recent call last): File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 327, in execute output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 202, in get_output_data return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 174, in _map_node_over_list process_inputs(input_dict, i) File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 163, in process_inputs results.append(getattr(obj, func)(**inputs)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\nodes.py", line 568, in load_checkpoint out = comfy.sd.load_checkpoint_guess_config(ckpt_path, output_vae=True, output_clip=True, embedding_directory=folder_paths.get_folder_paths("embeddings")) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 826, in load_checkpoint_guess_config out = load_state_dict_guess_config(sd, output_vae, output_clip, output_clipvision, embedding_directory, output_model, model_options, te_model_options=te_model_options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 881, in load_state_dict_guess_config m, u = clip.load_sd(clip_sd, full_model=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 228, in load_sd return self.cond_stage_model.load_state_dict(sd, strict=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "g:\StableDiffusion\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 2584, in load_state_dict raise RuntimeError( RuntimeError: Error(s) in loading state_dict for SD1ClipModel: size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).
I just saw that this is about Comfy not loading this model, not Onetrainer. You should open issue on Comfy repo about this.
Would really like to see long clip support in OneTrainer!
There is a proper comfyui extension to support it now, btw:
https://www.runcomfy.com/comfyui-nodes/ComfyUI-Long-CLIP