mergekit icon indicating copy to clipboard operation
mergekit copied to clipboard

Null vocab_file Issue with mistral v03 based models when using union tokenizer source

Open guillermo-gabrielli-fer opened this issue 6 months ago • 2 comments

Environment

Conda environment: python=3.10 mergekit commit f086664c983ad8b5f126d40ce2e4385f9e65f32c (latest as of yesterday) transformers from git @ git+https://github.com/huggingface/transformers 85817d98fb60977c97e3014196a462b732d2ed1a (latest as of yesterday)

Same issue with the transformers version installed by mergekit, I think it's 4.44

Issue

When merging two models based on mistral v03 base, at saving the base tokenizer to avoid mutating it (" # HACK: save base tokenizer to temp dir and reload to avoid mutating base_tok") , it fails to load it back.

configuration file (these were not the models I was trying originally but they reproduce the issue):

models:
  - model: mistralai/Mistral-7B-v0.3
  - model: mistralai/Mistral-7B-Instruct-v0.3
merge_method: slerp
base_model: mistralai/Mistral-7B-v0.3
tokenizer:
  source: union
parameters:
  t:
    - value: 0.8
dtype: bfloat16

Originally I was trying to merge the base model with one with a custom tokenizer with the same vocabulary size but different tokens, I can link the model if needed, but I'm having the same issue with any Mistral v0.3 based model, so the custom tokenizer doesn't appear to be the issue.

Exception: mergekit-yaml report_issue_mistral.yaml EXAMPLE_MISTRAL_ISSUE/ --out-shard-size 1B --cuda --lazy-unpickle -v

mergekit/mergekit/tokenizer/build.py", line 155, in build_union_tokenizer
    res = transformers.AutoTokenizer.from_pretrained(
[......]
transformers/models/llama/tokenization_llama.py", line 201, in get_spm_processor
    with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType

I could get past that error by saving also as legacy_format=True, but then it shows:

mergekit/mergekit/tokenizer/embed.py", line 62, in execute
    token_configs = dict(**self.tokens) or {}
TypeError: dict() argument after ** must be a mapping, not NoneType

I could get the merge to finish by moving the {} fallback inside the dict, but I'm not sure yet if the result is correct.

tracebacks.txt

pip_freeze.txt

guillermo-gabrielli-fer avatar Aug 09 '24 15:08 guillermo-gabrielli-fer