OneTrainer [Feat]: Long Clip support

Describe your use-case.

I'm asking for support training models with integrated Long Clip_L(246 effective tokens vs 75): https://arxiv.org/abs/2403.15378 I've asked and got the answer that integrating Long Clip_L is possible: https://huggingface.co/zer0int/LongCLIP-GmP-ViT-L-14/discussions/6

What would you like to see as a solution?

As I see things, It can be a checkbox in "Text Encoder 1" with a description like "It's Long Clip_L". When it's checked OT cuts captions after 246 tokens instead of 75

Have you considered alternatives? List them here.

No response

Jan 02 '25 12:01 miasik

As I have it working locally, but in not upstreamable way I'll write down what I figured out along the way.

Files of LongClip from this repo by default comes as whole ClipModel, OneTrainer by default use ClipTextModel (equivalent to ClipModel.text_model).

If my commit pass https://huggingface.co/zer0int/LongCLIP-GmP-ViT-L-14/commit/59dd3e4d98acf93ef5093091981fe447e947ae1c it will be easier to differentiate between Clip and LongClip just from config.json and set proper max_length in modules/model for models using Clip-L. For now pipeline won't run without changing config or setting somewhere

text_encoder.max_position_embeddings = 248

or

text_encoder.text_config.max_position_embeddings = 248

depending on implementation.

I have no idea how to differentiate between LongClip and Clip when using single file safetensor instead of diffusers format.

Jan 02 '25 16:01 Heasterian

You can download longClip in form that should work out of the box using this python code:

from transformers import CLIPTextModel, CLIPTokenizer

tokenizer = CLIPTokenizer.from_pretrained("zer0int/LongCLIP-GmP-ViT-L-14")
model = CLIPTextModel.from_pretrained("zer0int/LongCLIP-GmP-ViT-L-14")
tokenizer.save_pretrained("./tokenizer")
model.save_pretrained("./text_encoder")

Just download SD 1.5 or Flux model in diffusers format and overwrite models text_encoder and tokenizer directories with ones saved by script.

Than use branch from link below to have whole 248 token limit support.

https://github.com/Heasterian/OneTrainer/blob/LongClip/

I do not have Flux downloaded and tested, so let me know if it works as it should as implementation is a little bit different due to two different types of encoders.

Jan 04 '25 15:01 Heasterian

I used Comfy to replace Clip to LongClip for one of my models. The combined checkpoint was successfully saved and then loaded, but I got an error trying to render images with it.

[2025-01-04 00:11:27.017] CLIP model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[2025-01-04 00:11:27.122] !!! Exception during processing !!! Error(s) in loading state_dict for SD1ClipModel:
	size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).
[2025-01-04 00:11:27.125] Traceback (most recent call last):
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 327, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 202, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 174, in _map_node_over_list
    process_inputs(input_dict, i)
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 163, in process_inputs
    results.append(getattr(obj, func)(**inputs))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\nodes.py", line 568, in load_checkpoint
    out = comfy.sd.load_checkpoint_guess_config(ckpt_path, output_vae=True, output_clip=True, embedding_directory=folder_paths.get_folder_paths("embeddings"))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 826, in load_checkpoint_guess_config
    out = load_state_dict_guess_config(sd, output_vae, output_clip, output_clipvision, embedding_directory, output_model, model_options, te_model_options=te_model_options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 881, in load_state_dict_guess_config
    m, u = clip.load_sd(clip_sd, full_model=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 228, in load_sd
    return self.cond_stage_model.load_state_dict(sd, strict=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 2584, in load_state_dict
    raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for SD1ClipModel:
	size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).

Jan 04 '25 15:01 miasik

Well, you are loading model from single file, not diffusers format I mentioned. With safetensors code is falling back to 77 tokens as config does not include info about max position embeddings.

Does comfy save .yaml file along safetensors? If yes, send it here.

Jan 04 '25 15:01 Heasterian

Yeah, I know the word "diffusers" but I'm not sure I'm able to work with it No, Comfy doesn't save yaml I've just found that Comfy has an extension to work with diffusers but I can't try it right now https://github.com/Limitex/ComfyUI-Diffusers?tab=readme-ov-file :-(

Jan 04 '25 16:01 miasik

You can convert model to diffusers format using tool from tools tab in Onetrainer.

Jan 04 '25 16:01 Heasterian

Just overwrite text_encoder and tokenizer in resulting directory as I said here: https://github.com/Nerogar/OneTrainer/issues/624#issuecomment-2571331115

Jan 04 '25 16:01 Heasterian

I used Comfy to replace Clip to LongClip for one of my models. The combined checkpoint was successfully saved and then loaded, but I got an error trying to render images with it.

[2025-01-04 00:11:27.017] CLIP model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[2025-01-04 00:11:27.122] !!! Exception during processing !!! Error(s) in loading state_dict for SD1ClipModel:
	size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).
[2025-01-04 00:11:27.125] Traceback (most recent call last):
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 327, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 202, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 174, in _map_node_over_list
    process_inputs(input_dict, i)
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 163, in process_inputs
    results.append(getattr(obj, func)(**inputs))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\nodes.py", line 568, in load_checkpoint
    out = comfy.sd.load_checkpoint_guess_config(ckpt_path, output_vae=True, output_clip=True, embedding_directory=folder_paths.get_folder_paths("embeddings"))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 826, in load_checkpoint_guess_config
    out = load_state_dict_guess_config(sd, output_vae, output_clip, output_clipvision, embedding_directory, output_model, model_options, te_model_options=te_model_options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 881, in load_state_dict_guess_config
    m, u = clip.load_sd(clip_sd, full_model=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 228, in load_sd
    return self.cond_stage_model.load_state_dict(sd, strict=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 2584, in load_state_dict
    raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for SD1ClipModel:
	size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).

I just saw that this is about Comfy not loading this model, not Onetrainer. You should open issue on Comfy repo about this.

Jan 05 '25 08:01 Heasterian

Would really like to see long clip support in OneTrainer!

There is a proper comfyui extension to support it now, btw:

https://www.runcomfy.com/comfyui-nodes/ComfyUI-Long-CLIP

Mar 11 '25 14:03 ppbrown