Auto-set default encoding_model

Open natoverse opened this issue 1 year ago • 0 comments

Do you need to file an issue?

[X] I have searched the existing issues and this feature is not already filed.
[X] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
[X] I believe this is a legitimate feature request, not just a question. If this is a question, please use the Discussions area.

Is your feature request related to a problem? Please describe.

As OpenAI releases new models, they periodically update the encoding model used by tiktoken. This has resulted in some confusion with users, because our default model/encoding is gpt-4-turbo and cl100k_base, but as folks start to use gpt-4* they need to change the encoding to o200k_base.

Describe the solution you'd like

Tiktoken maintains a mapping, and the function encoding_name_for_model can be used to automatically look up any supported pairings. So this implementation would:

Set the default in GraphRAG config for encoding_model to None
At load time, if the encoding_model is not specified, look it up using tiktoken

This leaves the config available for folks who want to adjust it, but sets a reasonable fallback to reduce confusion for most users.

Additional context

Note that we do automatic fallback to the root encoding model for chunking, entity extraction, and claim extraction, so we need to make sure all of these fallbacks roll up correctly.

Aug 01 '24 17:08 natoverse