langchain 'Could not automatically map SimilarityCurie001 to a tokeniser. Please use `tiktok.get_encoding` to explicitly get the tokeniser you expect.'

This is related to AzureOpenAI call.

import os import tiktoken from langchain.embeddings import OpenAIEmbeddings from langchain.llms import AzureOpenAI

os.environ["OPENAI_API_TYPE"] = "azure" os.environ["OPENAI_API_BASE"] = "https://xxxxxxx.openai.azure.com/" os.environ["OPENAI_API_KEY"] = "xxxx"

embeddings = OpenAIEmbeddings(model="SimilarityCurie001-AzureDeploymentName")

text = "This is a test document." query_result = embeddings.embed_query(text)

Getting error on the execution of 'query_result = embeddings.embed_query(text)' line.

MODEL_TO_ENCODING variable is having all the encoding mapping against the real names of the models. but we specify AzureDeploymentName of the the model in embeddings = OpenAIEmbeddings(model="SimilarityCurie001-AzureDeploymentName").

and the look up fails.

Apr 13 '23 21:04 viksing

This seems to be a regression issue. I also encounter this with the latest version 0.0.139.

Apr 14 '23 06:04 zijie0

OpenAIEmbeddings is using your model name (SimilarityCurie001-AzureDeploymentName) instead of the actual model name (text-similarity-curie-001) to get the encoding from Tiktoken. And of course Tiktoken doesn't know that name.

So one quick fix would be to use the actual model name as Azure model name:

Instead of naming your model SimilarityCurie001-AzureDeploymentName you could name it text-similarity-curie-001. Then it should work because then it will be found in the mapping table of Tiktoken.

Sascha

Apr 14 '23 09:04 mastix

Anyone was able to fix this? I am having the same issue with text-embedding-ada-002. Was working perfectly fine 3 days ago, now same code is giving this error: KeyError: 'Could not automatically map AzureDevEnv-text-embedding-ada-002 to a tokeniser. Please use tiktok.get_encoding to explicitly get the tokeniser you expect.'

Apr 17 '23 00:04 skeretna

This is my code, I am not even using tiktoken:

def index_data(file_path):
    #Step-1
    loader = PyPDFLoader(file_path)
    pages = loader.load_and_split()
    
    #Step-2
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    documents = text_splitter.split_documents(pages)

    embeddings = OpenAIEmbeddings(chunk_size=1, model=ada_deployment_name)
    vectorstore = Chroma.from_documents(documents, embeddings)
    # expose this index in a retriever interface
    retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k":2})
    return retriever

Apr 17 '23 00:04 skeretna

As mentioned by @mastix issue is getting resolved if we are using same name for the deployment as model name in Azure OpenAI studio while deploying.

Apr 17 '23 09:04 Adityadt68

@skeretna what is the working langchain version? I update my langchain package and get the same fail, but it works before the updating.

what's sad is I couldn't change the deploy name, as the account is not managed by me. :smiling_face_with_tear:

Apr 18 '23 06:04 lslslslslslslslslslsls

@oreo-yum yeah, same thing happened to me after I upgraded. the work around I found is, put the model name instead of the deployment name, it's used to get the encoding name from tiktoken, so, it literally searched for an enumeration that maps to a model name. deployment name is also not managed by me and I kept it as is, but put the model name as input.

I am still having other issues because the VPN is now allowing tiktoken to connect to the internet, but at least it's a different error

P.S. tiktoken is used under the hood in OpenAIEmbedding library

Apr 18 '23 07:04 skeretna

Could we solve this by separating deployment and model names in the OpenAIEmbeddings ? Here is my naive attempt to fix this for good: https://github.com/hwchase17/langchain/pull/3076

Apr 18 '23 08:04 tunayokumus

@oreo-yum so version 0.0.132 works for me. Haven't tested 0.0.133, and 0.0.134, but from 0.0.135 onward, this bug exists.

Apr 18 '23 18:04 treelover28

@treelover28 yeah the error went away why I reverted to V0.0.132

Apr 19 '23 02:04 skeretna

@viksing To fix make sure to update to version 0.0.146 and initialize the model as follows:

from langchain.embeddings import OpenAIEmbeddings

embedding = OpenAIEmbeddings(deployment="deployment-name",model="openai-model-name")

That worked for me!

Apr 21 '23 14:04 danielrealesp

Thanks!

Apr 24 '23 03:04 RubenAMtz

I was able to work around this by manually updating the tiktoken dictionary:

import tiktoken
tiktoken.model.MODEL_TO_ENCODING["gpt-35-turbo"] = "cl100k_base"

Jun 23 '23 11:06 coltonpeltier-db

Ran into the same issue when I'm trying to swap vicuna model and test the difference here, any help would be appreciated

Jun 23 '23 18:06 bobmayuze

Hi, @viksing. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, the issue you reported is related to an error in the execution of the 'embed_query' function in the AzureOpenAI library. The error occurs because the specified model name, "SimilarityCurie001-AzureDeploymentName", does not have a corresponding tokeniser mapping. You suggested using 'tiktok.get_encoding' to explicitly get the tokeniser, and other users have confirmed the issue and provided workarounds, such as using the actual model name as the Azure model name or manually updating the tiktoken dictionary.

I wanted to check with you if this issue is still relevant to the latest version of the LangChain repository. If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

Sep 22 '23 16:09 dosubot[bot]

langchain langchain copied to clipboard

'Could not automatically map SimilarityCurie001 to a tokeniser. Please use `tiktok.get_encoding` to explicitly get the tokeniser you expect.'

langchain
langchain copied to clipboard