langchain icon indicating copy to clipboard operation
langchain copied to clipboard

'Could not automatically map SimilarityCurie001 to a tokeniser. Please use `tiktok.get_encoding` to explicitly get the tokeniser you expect.'

Open viksing opened this issue 1 year ago • 12 comments

This is related to AzureOpenAI call.

import os import tiktoken from langchain.embeddings import OpenAIEmbeddings from langchain.llms import AzureOpenAI

os.environ["OPENAI_API_TYPE"] = "azure" os.environ["OPENAI_API_BASE"] = "https://xxxxxxx.openai.azure.com/" os.environ["OPENAI_API_KEY"] = "xxxx"

embeddings = OpenAIEmbeddings(model="SimilarityCurie001-AzureDeploymentName")

text = "This is a test document." query_result = embeddings.embed_query(text)

Getting error on the execution of 'query_result = embeddings.embed_query(text)' line.

MODEL_TO_ENCODING variable is having all the encoding mapping against the real names of the models. but we specify AzureDeploymentName of the the model in embeddings = OpenAIEmbeddings(model="SimilarityCurie001-AzureDeploymentName").

and the look up fails.

viksing avatar Apr 13 '23 21:04 viksing

This seems to be a regression issue. I also encounter this with the latest version 0.0.139.

zijie0 avatar Apr 14 '23 06:04 zijie0

OpenAIEmbeddings is using your model name (SimilarityCurie001-AzureDeploymentName) instead of the actual model name (text-similarity-curie-001) to get the encoding from Tiktoken. And of course Tiktoken doesn't know that name.

So one quick fix would be to use the actual model name as Azure model name:

Instead of naming your model SimilarityCurie001-AzureDeploymentName you could name it text-similarity-curie-001. Then it should work because then it will be found in the mapping table of Tiktoken.

Sascha

mastix avatar Apr 14 '23 09:04 mastix

Anyone was able to fix this? I am having the same issue with text-embedding-ada-002. Was working perfectly fine 3 days ago, now same code is giving this error: KeyError: 'Could not automatically map AzureDevEnv-text-embedding-ada-002 to a tokeniser. Please use tiktok.get_encoding to explicitly get the tokeniser you expect.'

skeretna avatar Apr 17 '23 00:04 skeretna

This is my code, I am not even using tiktoken:

def index_data(file_path):
    #Step-1
    loader = PyPDFLoader(file_path)
    pages = loader.load_and_split()
    
    #Step-2
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    documents = text_splitter.split_documents(pages)

    embeddings = OpenAIEmbeddings(chunk_size=1, model=ada_deployment_name)
    vectorstore = Chroma.from_documents(documents, embeddings)
    # expose this index in a retriever interface
    retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k":2})
    return retriever

skeretna avatar Apr 17 '23 00:04 skeretna

As mentioned by @mastix issue is getting resolved if we are using same name for the deployment as model name in Azure OpenAI studio while deploying.

Adityadt68 avatar Apr 17 '23 09:04 Adityadt68

@skeretna what is the working langchain version? I update my langchain package and get the same fail, but it works before the updating.

what's sad is I couldn't change the deploy name, as the account is not managed by me. :smiling_face_with_tear:

lslslslslslslslslslsls avatar Apr 18 '23 06:04 lslslslslslslslslslsls

@oreo-yum yeah, same thing happened to me after I upgraded. the work around I found is, put the model name instead of the deployment name, it's used to get the encoding name from tiktoken, so, it literally searched for an enumeration that maps to a model name. deployment name is also not managed by me and I kept it as is, but put the model name as input.

I am still having other issues because the VPN is now allowing tiktoken to connect to the internet, but at least it's a different error

P.S. tiktoken is used under the hood in OpenAIEmbedding library

skeretna avatar Apr 18 '23 07:04 skeretna

Could we solve this by separating deployment and model names in the OpenAIEmbeddings ? Here is my naive attempt to fix this for good: https://github.com/hwchase17/langchain/pull/3076

tunayokumus avatar Apr 18 '23 08:04 tunayokumus

@oreo-yum so version 0.0.132 works for me. Haven't tested 0.0.133, and 0.0.134, but from 0.0.135 onward, this bug exists.

treelover28 avatar Apr 18 '23 18:04 treelover28

@treelover28 yeah the error went away why I reverted to V0.0.132

skeretna avatar Apr 19 '23 02:04 skeretna

@viksing To fix make sure to update to version 0.0.146 and initialize the model as follows:

from langchain.embeddings import OpenAIEmbeddings

embedding = OpenAIEmbeddings(deployment="deployment-name",model="openai-model-name")

That worked for me!

danielrealesp avatar Apr 21 '23 14:04 danielrealesp

Thanks!

RubenAMtz avatar Apr 24 '23 03:04 RubenAMtz

I was able to work around this by manually updating the tiktoken dictionary:

import tiktoken
tiktoken.model.MODEL_TO_ENCODING["gpt-35-turbo"] = "cl100k_base"

coltonpeltier-db avatar Jun 23 '23 11:06 coltonpeltier-db

Ran into the same issue when I'm trying to swap vicuna model and test the difference here, any help would be appreciated

bobmayuze avatar Jun 23 '23 18:06 bobmayuze

Hi, @viksing. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, the issue you reported is related to an error in the execution of the 'embed_query' function in the AzureOpenAI library. The error occurs because the specified model name, "SimilarityCurie001-AzureDeploymentName", does not have a corresponding tokeniser mapping. You suggested using 'tiktok.get_encoding' to explicitly get the tokeniser, and other users have confirmed the issue and provided workarounds, such as using the actual model name as the Azure model name or manually updating the tiktoken dictionary.

I wanted to check with you if this issue is still relevant to the latest version of the LangChain repository. If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

dosubot[bot] avatar Sep 22 '23 16:09 dosubot[bot]