sd-scripts Support for new T5 tokenizer

Just a heads up that there's a new T5 + tokenizer out here, which you may or may not want to officially implement support for.

Mar 15 '25 19:03 Kaoru8

What is the change required of this repo? Changing the tokenizer/tokenizer.json?

Mar 17 '25 23:03 rockerBOO

What is the change required of this repo? Changing the tokenizer/tokenizer.json?

Not changing it - that would break support for pre-existing T5XXL, and I already provide patches that do that. Ideally, this repo would support both tokenizers without breaking either.

So probably:

including my variant of tokenizer.json in the repo
dynamically changing the vocab_size in the currently hardcoded T5_CONFIG_JSON (library/flux_utils.py) depending on which tokenizer the user wants to use
handling the T5_XXL_TOKENIZER_ID variable in library/strategy_flux.py differently to point at the new tokenizer.json file when the user wants to use that, instead of being hardcoded to "google/t5-v1_1-xxl"

Mar 18 '25 02:03 Kaoru8

I think allowing users to set the tokenizer makes sense and the t5 config would come from the huggingface model typically, though hardcoded because we have people enter a model and not need the tokenizer specifically. For your model if you change t5_config_xxl.json to config.json then we could point to it and it would load the tokenizer.json and also the appropriate vocab size.

tokenizer = AutoTokenizer.from_pretrained('Kaoru8/T5XXL-Unchained')

then the user could enter a tokenizer

--tokenizer 'Kaoru8/T5XXL-Unchained'

Mar 18 '25 04:03 rockerBOO

That does seem like the most elegant solution. It would be future-compatible with any other tokenizers as well, and work with both HF repositories and local file paths.

I'll rename my config file on HF. Thank you :)

Mar 18 '25 07:03 Kaoru8

Just a heads up if anyone wants to try it, make sure you change your path for the tests folder if you're training on the cloud.

T5_XXL_TOKENIZER_ID = "/workspace/kohya_ss/sd-scripts/tests"

on the Strategy Flux File

Mar 19 '25 04:03 EClipXAi

Do you know if this works for sd forge? or do we have to patch that as well?

Thanks

Mar 19 '25 05:03 DarkViewAI

@Kaoru8, have you experimented with this new T5 tokenizer? If so, did you notice improvements?

Mar 21 '25 12:03 rcanepa

@Kaoru8, have you experimented with this new T5 tokenizer? If so, did you notice improvements?

I'm the one who released it, so I'm obviously biased and take my opinion with a grain of salt. It's also experimental.

But in my personal limited experience with it after fine-tuning - yes, the new tokenizer exhibits the characteristics described in the Project's README, and the already known issues with it dissipate and disappear with training.

TL;DR - If you just want to download new models and have uncensoring and the other benefits of the new tokenizer without training, this release will do nothing for you yet, and you'll just get slightly diminished prompt adherence and artifacts in some outputs. You'll probably have to wait a few weeks (or months) for other people to start training the models with the new tokenizer and releasing their LORAs and model merges trained on this.

If you want to and have the means to start training such LORAs yourself, you can start doing that with the current release right now.

Mar 21 '25 15:03 Kaoru8

Well I'm training it anyway.

Apr 08 '25 04:04 AbstractEyes

sd-scripts sd-scripts copied to clipboard

Support for new T5 tokenizer

sd-scripts
sd-scripts copied to clipboard