langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Tiktoken import bug?

Open sudowoodo200 opened this issue 1 year ago • 7 comments

https://github.com/hwchase17/langchain/blob/adcad98bee03ac8486f328b4f316017a6ccfc808/langchain/embeddings/openai.py#L159

Getting "no attribute" error for tiktoken.model. Believe that this is because tiktoken has changed their import model, per code here. Change to tiktoken.encoding_for_model(self.model)?

sudowoodo200 avatar Apr 29 '23 23:04 sudowoodo200

Having this issue as well.

shawnesquivel avatar May 04 '23 19:05 shawnesquivel

Changing to tiktoken.encoding_for_model(self.model) as you recommended gave me this error:

AttributeError: module 'tiktoken' has no attribute 'encoding_for_model'

tiktoken 0.1.2

shawnesquivel avatar May 04 '23 19:05 shawnesquivel

Here is the constructor __init__.py for tiktoken source

from .core import Encoding as Encoding
from .registry import get_encoding as get_encoding
from .registry import list_encoding_names as list_encoding_names

Thus we can see that if we use list_encoding_names we can get the list of good encoding names. So in langchain/embeddings/openai.py Old:

            # encoding = tiktoken.model.encoding_for_model(self.model)

New:

     print(tiktoken.list_encoding_names()) # check list of good encoding_names to use ['gpt2', 'r50k_base', 'p50k_base', 'cl100k_base']
     encoding = tiktoken.get_encoding(self.model)

As you can see in the source code, the model defaults to model: str = "text-embedding-ada-002". But that doesn't seem to work.

File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tiktoken/registry.py", line 60, in get_encoding
    raise ValueError(f"Unknown encoding {encoding_name}")
ValueError: Unknown encoding text-embedding-ada-002

In your project directory, e.g., main.py, you must use one of the embeddings models from the list_encoding_names output.So I set model="GPT2" and it worked. ( you can use any of these: ['gpt2', 'r50k_base', 'p50k_base', 'cl100k_base']).

   # model defaults to 'text-embedding-ada-002' which results in the unknown encoding error
    embeddings = OpenAIEmbeddings(
        model="gpt2", openai_api_key=os.environ.get("OPENAI_API_KEY")
    )

shawnesquivel avatar May 04 '23 21:05 shawnesquivel

In summary, my proposed changes that made it work for me are:

https://github.com/hwchase17/langchain/blob/master/langchain/embeddings/openai.py#L188 Change: encoding = tiktoken.model.encoding_for_model(self.model)

https://github.com/hwchase17/langchain/blob/master/langchain/embeddings/openai.py#L107 Change: model: str = "gpt2"

shawnesquivel avatar May 04 '23 21:05 shawnesquivel

@shawnesquivel did you create PR with this fix?

dmytrokarpovych avatar May 15 '23 17:05 dmytrokarpovych

@dmytrokarpovych

There is an open PR that is somewhat related #3819

I don't think it incorporates the default model that I used though.

shawnesquivel avatar May 15 '23 18:05 shawnesquivel

I had this issue and was able to resolve it by installing faiss-cpu

pip install faiss-cpu

rahdor avatar Jun 13 '23 20:06 rahdor

I have the same problem, and @rahdor ,your suggestion can't effect🤡

heavenkiller2018 avatar Jun 23 '23 21:06 heavenkiller2018

I also received the this message with more details: "most likely due to a circular import"

After tracking the packages I found that the my local py file is the same as the file name being used: "token.py" After I changed the local name it worked without problems

Avigin avatar Aug 30 '23 11:08 Avigin

Hi, @sudowoodo200. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, the issue is about a bug in the import of the tiktoken library. The suggested change in the import code to tiktoken.encoding_for_model(self.model) did not work for one user, but they found a workaround by using tiktoken.get_encoding(self.model) instead. Another user mentioned that installing faiss-cpu resolved the issue for them. There is an open pull request related to this issue, but it does not include the default model fix.

If this issue is still relevant to the latest version of the LangChain repository, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself. If we don't hear back from you, the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project. If you have any further questions or concerns, please don't hesitate to reach out.

Best regards, Dosu

dosubot[bot] avatar Nov 29 '23 16:11 dosubot[bot]

I named my python script tiktoken.py which is what gave the error. I renamed it and it works now

pepijnolivier avatar Jul 30 '24 12:07 pepijnolivier