mergekit
mergekit copied to clipboard
Null vocab_file Issue with mistral v03 based models when using union tokenizer source
Environment
Conda environment: python=3.10 mergekit commit f086664c983ad8b5f126d40ce2e4385f9e65f32c (latest as of yesterday) transformers from git @ git+https://github.com/huggingface/transformers 85817d98fb60977c97e3014196a462b732d2ed1a (latest as of yesterday)
Same issue with the transformers version installed by mergekit, I think it's 4.44
Issue
When merging two models based on mistral v03 base, at saving the base tokenizer to avoid mutating it (" # HACK: save base tokenizer to temp dir and reload to avoid mutating base_tok") , it fails to load it back.
configuration file (these were not the models I was trying originally but they reproduce the issue):
models:
- model: mistralai/Mistral-7B-v0.3
- model: mistralai/Mistral-7B-Instruct-v0.3
merge_method: slerp
base_model: mistralai/Mistral-7B-v0.3
tokenizer:
source: union
parameters:
t:
- value: 0.8
dtype: bfloat16
Originally I was trying to merge the base model with one with a custom tokenizer with the same vocabulary size but different tokens, I can link the model if needed, but I'm having the same issue with any Mistral v0.3 based model, so the custom tokenizer doesn't appear to be the issue.
Exception: mergekit-yaml report_issue_mistral.yaml EXAMPLE_MISTRAL_ISSUE/ --out-shard-size 1B --cuda --lazy-unpickle -v
mergekit/mergekit/tokenizer/build.py", line 155, in build_union_tokenizer
res = transformers.AutoTokenizer.from_pretrained(
[......]
transformers/models/llama/tokenization_llama.py", line 201, in get_spm_processor
with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
I could get past that error by saving also as legacy_format=True, but then it shows:
mergekit/mergekit/tokenizer/embed.py", line 62, in execute
token_configs = dict(**self.tokens) or {}
TypeError: dict() argument after ** must be a mapping, not NoneType
I could get the merge to finish by moving the {} fallback inside the dict, but I'm not sure yet if the result is correct.