azure-search-openai-demo icon indicating copy to clipboard operation
azure-search-openai-demo copied to clipboard

During prepdocs.py, openai.BadRequestError: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference...

Open Mshz2 opened this issue 1 year ago • 9 comments

This issue is for a:

- [x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

During Data Ingestion by running command of bash ./scripts/prepdocs.sh

Any log messages given by the failure

       Converting page 27 to image and uploading -> document.png
Batch Completed. Batch size  13 Token count 8077
Traceback (most recent call last):
  File "/home/azure-search-openai-demo/./scripts/prepdocs.py", line 310, in <module>
    loop.run_until_complete(main(file_strategy, azd_credential, args))
  File "/home/anaconda3/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/azure-search-openai-demo/./scripts/prepdocs.py", line 160, in main
    await strategy.run(search_info)
  File "/home/azure-search-openai-demo/scripts/prepdocslib/filestrategy.py", line 76, in run
    await search_manager.update_content(sections, blob_image_embeddings)
  File "/home/azure-search-openai-demo/scripts/prepdocslib/searchmanager.py", line 170, in update_content
    embeddings = await self.embeddings.create_embeddings(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/azure-search-openai-demo/scripts/prepdocslib/embeddings.py", line 118, in create_embeddings
    return await self.create_embedding_batch(texts)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/azure-search-openai-demo/scripts/prepdocslib/embeddings.py", line 89, in create_embedding_batch
    async for attempt in AsyncRetrying(
  File "/home/azure-search-openai-demo/scripts/.venv/lib/python3.11/site-packages/tenacity/_asyncio.py", line 71, in __anext__
    do = self.iter(retry_state=self._retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/azure-search-openai-demo/scripts/.venv/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
    return fut.result()
           ^^^^^^^^^^^^
  File "/home/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/azureuser/maintenance/azure-search-openai-demo/scripts/prepdocslib/embeddings.py", line 96, in create_embedding_batch
    emb_response = await client.embeddings.create(model=self.open_ai_model_name, input=batch.texts)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/azure-search-openai-demo/scripts/.venv/lib/python3.11/site-packages/openai/resources/embeddings.py", line 198, in create
    return await self._post(
           ^^^^^^^^^^^^^^^^^
  File "/home/azure-search-openai-demo/scripts/.venv/lib/python3.11/site-packages/openai/_base_client.py", line 1542, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/azure-search-openai-demo/scripts/.venv/lib/python3.11/site-packages/openai/_base_client.py", line 1316, in request
    return await self._request(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/azure-search-openai-demo/scripts/.venv/lib/python3.11/site-packages/openai/_base_client.py", line 1368, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) Ubuntu 20.04

azd version?

run azd version and copy paste here. azd version 1.5.1

Versions

Python 3.10.13

Mshz2 avatar Jan 15 '24 11:01 Mshz2

HI @Mshz2 ,

Thanks for reporting this issue. Are you using the sample data or your own custom data set? Are you using open ai embeddings, or azure open ai embeddings?

mattgotteiner avatar Jan 17 '24 00:01 mattgotteiner

@mattgotteiner Hi, thanks for the reply. I am using my own PDFs. Some of them can have above 100 pages. I'm using azure openai.

Mshz2 avatar Jan 17 '24 00:01 Mshz2

thanks - it's possible a single document is causing this error. we'll have to file a follow-up issue to skip documents that have this error and then you can retry

mattgotteiner avatar Jan 17 '24 00:01 mattgotteiner

@mattgotteiner I also having the same issue. some documents are fine.. wondering what kind document will cause this issue raise self._make_status_error_from_response(err.response) from None openai.BadRequestError: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

singloudly90 avatar Mar 06 '24 08:03 singloudly90

I have the same problem. This may depend on the switch to the new client lib azure-ai-documentintelligence==1.0.0b1. With old releases or local parser it works.

tchmahe avatar Mar 11 '24 21:03 tchmahe

HAve the same problem. Did someone find the solution? Can't use some of my pdf files. Changing the version of Azure did not worK. Could it be if there are blank spaces at the beginning of the pdf? Appreciate help ! Thanks a lot

czumbiehl avatar Mar 12 '24 21:03 czumbiehl

I do have the same issue and it only happens for selective files. I am not sure if the images or tables in the file are causing the issue but there is no specific pattern between the files. Was anyone able to resolve the issue and any tips would be helpful!

drajinvites82 avatar Mar 27 '24 05:03 drajinvites82

Here is the update on this issue for everyone. The below solution from pamelafox works,

Update: This error is happening when we pass a text of length 0 (an empty string) to the batch embeddings API. The single embedding API is fine with that input, but the batch embedding API is not. (See https://github.com/openai/openai-python/issues/576)

Now, I don't know yet why we have sections that have 0 text in them, as I don't expect that to happen in most cases (possibly for GPT4-vision, but this occurs with vision disabled as well). I'm going to ask @tonybaloney to see if it was related to the recent splitting change.

As another workaround, you can put this code in create_embedding_batch:

replace any empty strings with whitespace for now

batch.texts = [text if text else " " for text in batch.texts] emb_response = await client.embeddings.create(model=self.open_ai_model_name, input=batch.texts) The batch embedding endpoint seems fine with a whitespace string.

https://github.com/Azure-Samples/azure-search-openai-demo/issues/1415

drajinvites82 avatar Mar 27 '24 07:03 drajinvites82

There was a proper fix in the code base about 2 weeks ago, please pull/download the latest release

tonybaloney avatar Mar 27 '24 08:03 tonybaloney