sd-scripts icon indicating copy to clipboard operation
sd-scripts copied to clipboard

Support for new T5 tokenizer

Open Kaoru8 opened this issue 8 months ago • 9 comments

Just a heads up that there's a new T5 + tokenizer out here, which you may or may not want to officially implement support for.

Kaoru8 avatar Mar 15 '25 19:03 Kaoru8

What is the change required of this repo? Changing the tokenizer/tokenizer.json?

rockerBOO avatar Mar 17 '25 23:03 rockerBOO

What is the change required of this repo? Changing the tokenizer/tokenizer.json?

Not changing it - that would break support for pre-existing T5XXL, and I already provide patches that do that. Ideally, this repo would support both tokenizers without breaking either.

So probably:

  • including my variant of tokenizer.json in the repo
  • dynamically changing the vocab_size in the currently hardcoded T5_CONFIG_JSON (library/flux_utils.py) depending on which tokenizer the user wants to use
  • handling the T5_XXL_TOKENIZER_ID variable in library/strategy_flux.py differently to point at the new tokenizer.json file when the user wants to use that, instead of being hardcoded to "google/t5-v1_1-xxl"

Kaoru8 avatar Mar 18 '25 02:03 Kaoru8

I think allowing users to set the tokenizer makes sense and the t5 config would come from the huggingface model typically, though hardcoded because we have people enter a model and not need the tokenizer specifically. For your model if you change t5_config_xxl.json to config.json then we could point to it and it would load the tokenizer.json and also the appropriate vocab size.

tokenizer = AutoTokenizer.from_pretrained('Kaoru8/T5XXL-Unchained')

then the user could enter a tokenizer

--tokenizer 'Kaoru8/T5XXL-Unchained'

rockerBOO avatar Mar 18 '25 04:03 rockerBOO

That does seem like the most elegant solution. It would be future-compatible with any other tokenizers as well, and work with both HF repositories and local file paths.

I'll rename my config file on HF. Thank you :)

Kaoru8 avatar Mar 18 '25 07:03 Kaoru8

Just a heads up if anyone wants to try it, make sure you change your path for the tests folder if you're training on the cloud.

T5_XXL_TOKENIZER_ID = "/workspace/kohya_ss/sd-scripts/tests"

on the Strategy Flux File

EClipXAi avatar Mar 19 '25 04:03 EClipXAi

Do you know if this works for sd forge? or do we have to patch that as well?

Thanks

DarkViewAI avatar Mar 19 '25 05:03 DarkViewAI

@Kaoru8, have you experimented with this new T5 tokenizer? If so, did you notice improvements?

rcanepa avatar Mar 21 '25 12:03 rcanepa

@Kaoru8, have you experimented with this new T5 tokenizer? If so, did you notice improvements?

I'm the one who released it, so I'm obviously biased and take my opinion with a grain of salt. It's also experimental.

But in my personal limited experience with it after fine-tuning - yes, the new tokenizer exhibits the characteristics described in the Project's README, and the already known issues with it dissipate and disappear with training.

TL;DR - If you just want to download new models and have uncensoring and the other benefits of the new tokenizer without training, this release will do nothing for you yet, and you'll just get slightly diminished prompt adherence and artifacts in some outputs. You'll probably have to wait a few weeks (or months) for other people to start training the models with the new tokenizer and releasing their LORAs and model merges trained on this.

If you want to and have the means to start training such LORAs yourself, you can start doing that with the current release right now.

Kaoru8 avatar Mar 21 '25 15:03 Kaoru8

Well I'm training it anyway.

AbstractEyes avatar Apr 08 '25 04:04 AbstractEyes