langchain
langchain copied to clipboard
Tiktoken import bug?
https://github.com/hwchase17/langchain/blob/adcad98bee03ac8486f328b4f316017a6ccfc808/langchain/embeddings/openai.py#L159
Getting "no attribute" error for tiktoken.model
. Believe that this is because tiktoken has changed their import model, per code here. Change to tiktoken.encoding_for_model(self.model)
?
Having this issue as well.
Changing to tiktoken.encoding_for_model(self.model)
as you recommended gave me this error:
AttributeError: module 'tiktoken' has no attribute 'encoding_for_model'
tiktoken 0.1.2
Here is the constructor __init__.py
for tiktoken source
from .core import Encoding as Encoding
from .registry import get_encoding as get_encoding
from .registry import list_encoding_names as list_encoding_names
Thus we can see that if we use list_encoding_names
we can get the list of good encoding names.
So in langchain/embeddings/openai.py
Old:
# encoding = tiktoken.model.encoding_for_model(self.model)
New:
print(tiktoken.list_encoding_names()) # check list of good encoding_names to use ['gpt2', 'r50k_base', 'p50k_base', 'cl100k_base']
encoding = tiktoken.get_encoding(self.model)
As you can see in the source code, the model defaults to model: str = "text-embedding-ada-002"
. But that doesn't seem to work.
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tiktoken/registry.py", line 60, in get_encoding
raise ValueError(f"Unknown encoding {encoding_name}")
ValueError: Unknown encoding text-embedding-ada-002
In your project directory, e.g., main.py
, you must use one of the embeddings models from the list_encoding_names
output.So I set model="GPT2"
and it worked. ( you can use any of these: ['gpt2', 'r50k_base', 'p50k_base', 'cl100k_base']).
# model defaults to 'text-embedding-ada-002' which results in the unknown encoding error
embeddings = OpenAIEmbeddings(
model="gpt2", openai_api_key=os.environ.get("OPENAI_API_KEY")
)
In summary, my proposed changes that made it work for me are:
https://github.com/hwchase17/langchain/blob/master/langchain/embeddings/openai.py#L188
Change:
encoding = tiktoken.model.encoding_for_model(self.model)
https://github.com/hwchase17/langchain/blob/master/langchain/embeddings/openai.py#L107
Change:
model: str = "gpt2"
@shawnesquivel did you create PR with this fix?
@dmytrokarpovych
There is an open PR that is somewhat related #3819
I don't think it incorporates the default model that I used though.
I had this issue and was able to resolve it by installing faiss-cpu
pip install faiss-cpu
I have the same problem, and @rahdor ,your suggestion can't effect🤡
I also received the this message with more details: "most likely due to a circular import"
After tracking the packages I found that the my local py file is the same as the file name being used: "token.py" After I changed the local name it worked without problems
Hi, @sudowoodo200. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
Based on my understanding, the issue is about a bug in the import of the tiktoken
library. The suggested change in the import code to tiktoken.encoding_for_model(self.model)
did not work for one user, but they found a workaround by using tiktoken.get_encoding(self.model)
instead. Another user mentioned that installing faiss-cpu
resolved the issue for them. There is an open pull request related to this issue, but it does not include the default model fix.
If this issue is still relevant to the latest version of the LangChain repository, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself. If we don't hear back from you, the issue will be automatically closed in 7 days.
Thank you for your understanding and contribution to the LangChain project. If you have any further questions or concerns, please don't hesitate to reach out.
Best regards, Dosu
I named my python script tiktoken.py
which is what gave the error.
I renamed it and it works now