graphrag
graphrag copied to clipboard
[compatibility issue] Support open source LLM model to prompt-tune
Description
When using open source LLM model like gemma2-9b-it to do prompt-tune,
python -m graphrag.prompt_tune --root . --domain "novels" --language English --chunk-size 300
it reports error about Could not automatically map gemma2-9b-it to a tokeniser. Please use tiktoken.get_encoding to explicitly get the tokeniser you expect..
The detail exception log:
Traceback (most recent call last):
File "/Users/evilkylin/Projects/miniforge3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/evilkylin/Projects/miniforge3/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/evilkylin/Projects/graphrag/graphrag/prompt_tune/__main__.py", line 108, in <module>
loop.run_until_complete(
File "/Users/evilkylin/Projects/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/Users/evilkylin/Projects/graphrag/graphrag/prompt_tune/cli.py", line 62, in fine_tune
await fine_tune_with_config(
File "/Users/evilkylin/Projects/graphrag/graphrag/prompt_tune/cli.py", line 132, in fine_tune_with_config
await generate_indexing_prompts(
File "/Users/evilkylin/Projects/graphrag/graphrag/prompt_tune/cli.py", line 215, in generate_indexing_prompts
create_entity_extraction_prompt(
File "/Users/evilkylin/Projects/graphrag/graphrag/prompt_tune/generator/entity_extraction_prompt.py", line 58, in create_entity_extraction_prompt
- num_tokens_from_string(prompt, model=model_name)
File "/Users/evilkylin/Projects/graphrag/graphrag/index/utils/tokens.py", line 16, in num_tokens_from_string
encoding = tiktoken.encoding_for_model(model)
File "/Users/evilkylin/Library/Caches/pypoetry/virtualenvs/graphrag-g0mKwYYC-py3.10/lib/python3.10/site-packages/tiktoken/model.py", line 103, in encoding_for_model
return get_encoding(encoding_name_for_model(model_name))
File "/Users/evilkylin/Library/Caches/pypoetry/virtualenvs/graphrag-g0mKwYYC-py3.10/lib/python3.10/site-packages/tiktoken/model.py", line 90, in encoding_name_for_model
raise KeyError(
KeyError: 'Could not automatically map gemma2-9b-it to a tokeniser. Please use `tiktoken.get_encoding` to explicitly get the tokeniser you expect.'
Related Issues
Proposed Changes
- catch the exception about tiktoken.get_encoding and fallback to the default encoding: cl100k_base.
Checklist
- [ ] I have tested these changes locally.
- [ ] I have reviewed the code changes.
- [ ] I have updated the documentation (if necessary).
- [ ] I have added appropriate unit tests (if applicable).
Additional Notes
[Add any additional notes or context that may be helpful for the reviewer(s).]