[compatibility issue] Support open source LLM model to prompt-tune

Open KylinMountain opened this issue 1 year ago • 0 comments

Description

When using open source LLM model like gemma2-9b-it to do prompt-tune,

python -m graphrag.prompt_tune --root . --domain "novels" --language English --chunk-size 300

it reports error about Could not automatically map gemma2-9b-it to a tokeniser. Please use tiktoken.get_encoding to explicitly get the tokeniser you expect..

The detail exception log:

Traceback (most recent call last):
  File "/Users/evilkylin/Projects/miniforge3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/evilkylin/Projects/miniforge3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/evilkylin/Projects/graphrag/graphrag/prompt_tune/__main__.py", line 108, in <module>
    loop.run_until_complete(
  File "/Users/evilkylin/Projects/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/Users/evilkylin/Projects/graphrag/graphrag/prompt_tune/cli.py", line 62, in fine_tune
    await fine_tune_with_config(
  File "/Users/evilkylin/Projects/graphrag/graphrag/prompt_tune/cli.py", line 132, in fine_tune_with_config
    await generate_indexing_prompts(
  File "/Users/evilkylin/Projects/graphrag/graphrag/prompt_tune/cli.py", line 215, in generate_indexing_prompts
    create_entity_extraction_prompt(
  File "/Users/evilkylin/Projects/graphrag/graphrag/prompt_tune/generator/entity_extraction_prompt.py", line 58, in create_entity_extraction_prompt
    - num_tokens_from_string(prompt, model=model_name)
  File "/Users/evilkylin/Projects/graphrag/graphrag/index/utils/tokens.py", line 16, in num_tokens_from_string
    encoding = tiktoken.encoding_for_model(model)
  File "/Users/evilkylin/Library/Caches/pypoetry/virtualenvs/graphrag-g0mKwYYC-py3.10/lib/python3.10/site-packages/tiktoken/model.py", line 103, in encoding_for_model
    return get_encoding(encoding_name_for_model(model_name))
  File "/Users/evilkylin/Library/Caches/pypoetry/virtualenvs/graphrag-g0mKwYYC-py3.10/lib/python3.10/site-packages/tiktoken/model.py", line 90, in encoding_name_for_model
    raise KeyError(
KeyError: 'Could not automatically map gemma2-9b-it to a tokeniser. Please use `tiktoken.get_encoding` to explicitly get the tokeniser you expect.'

Related Issues

Proposed Changes

catch the exception about tiktoken.get_encoding and fallback to the default encoding: cl100k_base.

Checklist

[ ] I have tested these changes locally.
[ ] I have reviewed the code changes.
[ ] I have updated the documentation (if necessary).
[ ] I have added appropriate unit tests (if applicable).

Additional Notes

[Add any additional notes or context that may be helpful for the reviewer(s).]

Jul 11 '24 09:07 KylinMountain