openai-python icon indicating copy to clipboard operation
openai-python copied to clipboard

JSON parsing issues with embedding create with large batch size.

Open ellisonbg opened this issue 2 years ago • 3 comments

Describe the bug

When I call Embedding.create with a large number of text chunks (batch) I get JSON decoding errors in the reponse. If I keep the batch size small (say 50) it works fine, but for large batch sizes (say 12k) it shows this problem. Looks very similar to the problems seen in #184. I initially saw this when using langchain, but reproduced the openai alone. Oddly, I sometimes get an InvalidRequestError instead.

To Reproduce

Run the following code:

texts = ["AI"*100 for i in range(4000)]
e = openai.Embedding.create(input=texts, model="text-embedding-ada-002")

Code snippets

No response

OS

macOS

Python version

Python 3.11

Library version

0.26.5

ellisonbg avatar Mar 01 '23 02:03 ellisonbg

Ditto, and bump.

v0.27.1

The problem is two-fold:

  • sometimes response becomes plain str, not a real object
  • tenacity makes it worse — retry on a clearly erroneous case
    • should only retry on 5xx server errors, not on client errors. When there's a client error, it's guaranteed to fail again and waste of resource on both ends + only wait longer until timeout.
Traceback (most recent call last):
  File "/proj/.venv/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/proj/.venv/lib/python3.11/site-packages/llama_index/embeddings/openai.py", line 147, in get_embeddings
    data = openai.Embedding.create(input=list_of_text, engine=engine).data
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/proj/.venv/lib/python3.11/site-packages/openai/api_resources/embedding.py", line 38, in create
    for data in response.data:
                ^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'data'

Here's an example Response in str that's been failing:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": "..."
    },
    {
      "object": "embedding",
      "index": 1,
      "embedding": "..."
    },

    ...(snip)...

    {
      "object": "embedding",
      "index": 8,
      "embedding": "..."
    },
    {
      "object": "embedding",
      "index": 9,
      "embedding": "..."
    }
  ],
  "model": "text-embedding-ada-002-v2",
  "usage": {
    "prompt_tokens": 7481,
    "total_tokens": 7481
  }
}

kenn avatar Mar 11 '23 01:03 kenn

Probably related:

https://github.com/jerryjliu/gpt_index/issues/579

openai.error.InvalidRequestError: [''] is not valid under any of the given schemas - 'input'

kenn avatar Mar 11 '23 02:03 kenn

OK this bug won't occur with v0.27.0 so it seems like a v0.27.1 specific issue.

kenn avatar Mar 11 '23 02:03 kenn

This should be resolved in v1 of this library. If that's not the case, please open a new issue.

rattrayalex avatar Dec 31 '23 00:12 rattrayalex