langchain
langchain copied to clipboard
'Could not automatically map SimilarityCurie001 to a tokeniser. Please use `tiktok.get_encoding` to explicitly get the tokeniser you expect.'
This is related to AzureOpenAI call.
import os import tiktoken from langchain.embeddings import OpenAIEmbeddings from langchain.llms import AzureOpenAI
os.environ["OPENAI_API_TYPE"] = "azure" os.environ["OPENAI_API_BASE"] = "https://xxxxxxx.openai.azure.com/" os.environ["OPENAI_API_KEY"] = "xxxx"
embeddings = OpenAIEmbeddings(model="SimilarityCurie001-AzureDeploymentName")
text = "This is a test document." query_result = embeddings.embed_query(text)
Getting error on the execution of 'query_result = embeddings.embed_query(text)' line.
MODEL_TO_ENCODING variable is having all the encoding mapping against the real names of the models. but we specify AzureDeploymentName of the the model in embeddings = OpenAIEmbeddings(model="SimilarityCurie001-AzureDeploymentName").
and the look up fails.
This seems to be a regression issue. I also encounter this with the latest version 0.0.139.
OpenAIEmbeddings is using your model name (SimilarityCurie001-AzureDeploymentName) instead of the actual model name (text-similarity-curie-001) to get the encoding from Tiktoken. And of course Tiktoken doesn't know that name.
So one quick fix would be to use the actual model name as Azure model name:
Instead of naming your model SimilarityCurie001-AzureDeploymentName
you could name it text-similarity-curie-001
. Then it should work because then it will be found in the mapping table of Tiktoken.
Sascha
Anyone was able to fix this? I am having the same issue with text-embedding-ada-002. Was working perfectly fine 3 days ago, now same code is giving this error:
KeyError: 'Could not automatically map AzureDevEnv-text-embedding-ada-002 to a tokeniser. Please use tiktok.get_encoding
to explicitly get the tokeniser you expect.'
This is my code, I am not even using tiktoken:
def index_data(file_path):
#Step-1
loader = PyPDFLoader(file_path)
pages = loader.load_and_split()
#Step-2
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(pages)
embeddings = OpenAIEmbeddings(chunk_size=1, model=ada_deployment_name)
vectorstore = Chroma.from_documents(documents, embeddings)
# expose this index in a retriever interface
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k":2})
return retriever
As mentioned by @mastix issue is getting resolved if we are using same name for the deployment as model name in Azure OpenAI studio while deploying.
@skeretna what is the working langchain version? I update my langchain package and get the same fail, but it works before the updating.
what's sad is I couldn't change the deploy name, as the account is not managed by me. :smiling_face_with_tear:
@oreo-yum yeah, same thing happened to me after I upgraded. the work around I found is, put the model name instead of the deployment name, it's used to get the encoding name from tiktoken, so, it literally searched for an enumeration that maps to a model name. deployment name is also not managed by me and I kept it as is, but put the model name as input.
I am still having other issues because the VPN is now allowing tiktoken to connect to the internet, but at least it's a different error
P.S. tiktoken is used under the hood in OpenAIEmbedding library
Could we solve this by separating deployment and model names in the OpenAIEmbeddings ? Here is my naive attempt to fix this for good: https://github.com/hwchase17/langchain/pull/3076
@oreo-yum so version 0.0.132 works for me. Haven't tested 0.0.133, and 0.0.134, but from 0.0.135 onward, this bug exists.
@treelover28 yeah the error went away why I reverted to V0.0.132
@viksing To fix make sure to update to version 0.0.146 and initialize the model as follows:
from langchain.embeddings import OpenAIEmbeddings
embedding = OpenAIEmbeddings(deployment="deployment-name",model="openai-model-name")
That worked for me!
Thanks!
I was able to work around this by manually updating the tiktoken dictionary:
import tiktoken
tiktoken.model.MODEL_TO_ENCODING["gpt-35-turbo"] = "cl100k_base"
Ran into the same issue when I'm trying to swap vicuna model and test the difference here, any help would be appreciated
Hi, @viksing. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
Based on my understanding, the issue you reported is related to an error in the execution of the 'embed_query' function in the AzureOpenAI library. The error occurs because the specified model name, "SimilarityCurie001-AzureDeploymentName", does not have a corresponding tokeniser mapping. You suggested using 'tiktok.get_encoding' to explicitly get the tokeniser, and other users have confirmed the issue and provided workarounds, such as using the actual model name as the Azure model name or manually updating the tiktoken dictionary.
I wanted to check with you if this issue is still relevant to the latest version of the LangChain repository. If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository!