azure-search-openai-demo icon indicating copy to clipboard operation
azure-search-openai-demo copied to clipboard

Getting error during extracting text from pdf while ding deployment

Open TarunKC261 opened this issue 1 year ago • 27 comments

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful


Thanks! We'll be in touch soon.

TarunKC261 avatar Oct 31 '23 10:10 TarunKC261

Uploading blob for whole file -> Deep Learning.pdf Extracting text from 'C:\CsuEnterpriseSearch/data\Introduction_to_algorithms-3rd Edition.pdf' using Azure Form Recognizer Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object Traceback (most recent call last): File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline\transport_aiohttp.py", line 484, in load_body self._content = await self.internal_response.read() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\aiohttp\client_reqrep.py", line 1037, in read self._body = await self.content.read() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\aiohttp\streams.py", line 375, in read block = await self.readany() ^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\aiohttp\streams.py", line 397, in readany await self._wait("readany") File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\aiohttp\streams.py", line 304, in _wait await waiter aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\CsuEnterpriseSearch\scripts\prepdocs.py", line 256, in loop.run_until_complete(main(file_strategy, azd_credential, args)) File "C:\Users\ChoubeTK\AppData\Local\Programs\Python\Python311\Lib\asyncio\base_events.py", line 650, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts\prepdocs.py", line 131, in main await strategy.run(search_info) File "C:\CsuEnterpriseSearch\scripts\prepdocslib\filestrategy.py", line 56, in run pages = [page async for page in self.pdf_parser.parse(content=file.content)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts\prepdocslib\filestrategy.py", line 56, in pages = [page async for page in self.pdf_parser.parse(content=file.content)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts\prepdocslib\pdfparser.py", line 82, in parse form_recognizer_results = await poller.result() ^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\polling_async_poller.py", line 179, in result await self.wait() File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\polling_async_poller.py", line 191, in wait await self._polling_method.run() File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\polling\async_base_polling.py", line 89, in run await self._poll() File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\polling\async_base_polling.py", line 118, in _poll await self.update_status() File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\polling\async_base_polling.py", line 140, in update_status self._pipeline_response = await self.request_status(self._operation.get_polling_url()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\polling\async_base_polling.py", line 174, in request_status await self._client._pipeline.run( # pylint: disable=protected-access File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 221, in run return await first_node.send(pipeline_request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 69, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 69, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 69, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [Previous line repeated 2 more times] File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline\policies_redirect_async.py", line 73, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline\policies_retry_async.py", line 205, in send raise err File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline\policies_retry_async.py", line 179, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline\policies_authentication_async.py", line 94, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 69, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 69, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 69, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [Previous line repeated 3 more times] File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 106, in send await self._sender.send(request.http_request, **request.context.options), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline\transport_aiohttp.py", line 294, in send await response.load_body() File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline\transport_aiohttp.py", line 488, in load_body raise IncompleteReadError(err, error=err) from err azure.core.exceptions.IncompleteReadError: Response payload is not completed

TarunKC261 avatar Oct 31 '23 10:10 TarunKC261

It completes extraction for one of the pdf.But throws error while doing extraction for second pdf as shown in log above.

TarunKC261 avatar Oct 31 '23 10:10 TarunKC261

Hello, I'm also getting the same error.

Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object Traceback (most recent call last):

vicky002 avatar Nov 03 '23 12:11 vicky002

Same problem after running ./scripts/prepdocs.sh.

In my case, it happens when ingest some PDFs with larger pages (such as an entire 300-page book). It will get stuck on the following prompt for about 5-15 minutes before the error happens:

Extracting text from './data/demobook.pdf' using Azure Document Intelligence

here is the full log:

Extracting text from './data/demobook.pdf' using Azure Document Intelligence Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object Traceback (most recent call last): File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/transport/_aiohttp.py", line 501, in load_body self._content = await self.internal_response.read() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1100, in read self._body = await self.content.read() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/aiohttp/streams.py", line 373, in read block = await self.readany() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/aiohttp/streams.py", line 395, in readany await self._wait("readany") File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/aiohttp/streams.py", line 302, in _wait await waiter aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Applications/azure-ai-research/./scripts/prepdocs.py", line 256, in loop.run_until_complete(main(file_strategy, azd_credential, args)) File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete return future.result() File "/Applications/azure-ai-research/./scripts/prepdocs.py", line 131, in main await strategy.run(search_info) File "/Applications/azure-ai-research/scripts/prepdocslib/filestrategy.py", line 56, in run pages = [page async for page in self.pdf_parser.parse(content=file.content)] File "/Applications/azure-ai-research/scripts/prepdocslib/filestrategy.py", line 56, in pages = [page async for page in self.pdf_parser.parse(content=file.content)] File "/Applications/azure-ai-research/scripts/prepdocslib/pdfparser.py", line 82, in parse form_recognizer_results = await poller.result() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/polling/_async_poller.py", line 179, in result await self.wait() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/polling/_async_poller.py", line 191, in wait await self._polling_method.run() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/polling/async_base_polling.py", line 89, in run await self._poll() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/polling/async_base_polling.py", line 118, in _poll await self.update_status() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/polling/async_base_polling.py", line 140, in update_status self._pipeline_response = await self.request_status(self._operation.get_polling_url()) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/polling/async_base_polling.py", line 174, in request_status await self._client._pipeline.run( # pylint: disable=protected-access File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 221, in run return await first_node.send(pipeline_request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 69, in send response = await self.next.send(request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 69, in send response = await self.next.send(request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 69, in send response = await self.next.send(request) [Previous line repeated 2 more times] File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/policies/_redirect_async.py", line 73, in send response = await self.next.send(request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/policies/_retry_async.py", line 205, in send raise err File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/policies/_retry_async.py", line 179, in send response = await self.next.send(request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/policies/_authentication_async.py", line 94, in send response = await self.next.send(request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 69, in send response = await self.next.send(request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 69, in send response = await self.next.send(request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 69, in send response = await self.next.send(request) [Previous line repeated 3 more times] File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 106, in send await self._sender.send(request.http_request, **request.context.options), File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/transport/_aiohttp.py", line 311, in send await response.load_body() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/transport/_aiohttp.py", line 505, in load_body raise IncompleteReadError(err, error=err) from err azure.core.exceptions.IncompleteReadError: Response payload is not completed

Any solution? thanks

YIN-Renlong avatar Dec 03 '23 15:12 YIN-Renlong

While using Azure AI Document Intelligence, I am facing the similar issue:

Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object
An error occurred: (InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Exception Details:	(FailedToSerializeAnalyzeResult) Failed to serialize analyze results, please contact support.
	Code: FailedToSerializeAnalyzeResult
	Message: Failed to serialize analyze results, please contact support.

Is this issue still under consideration for resolution? Thanks

hammad26 avatar Feb 15 '24 05:02 hammad26

@hammad26 Are you able to email the PDF where you experienced the issue to pamelafox@ microsoft.com? If I can replicate the error, then I can more easily share it with the Document Intelligence team. Otherwise, please indicate the size of the PDF file that caused the error.

pamelafox avatar Feb 16 '24 00:02 pamelafox

@pamelafox I have just sent you the problematic document.

hammad26 avatar Feb 20 '24 11:02 hammad26

Update: The Document Intelligence team is now investigating.

pamelafox avatar Feb 22 '24 21:02 pamelafox

@pamelafox Any updates on the investigation? Thanks

hammad26 avatar Mar 12 '24 12:03 hammad26

@pamelafox I am facing the same issue. Any updates? Many thanks in advance.

El-Brabo avatar Mar 28 '24 13:03 El-Brabo

@pamelafox Same for us when processing Excel files of a certain size. Workaround we have is to split the excels into multiple ones.

ardab avatar Apr 26 '24 09:04 ardab

Hi. Is there any update on this issue, or workaround please? I'm hitting the same problem, with larger PDFs, which includes some of the files in the sample dataset. Interestingly I took the "role_library.pdf" document, which has 31 pages, and extracted shortened versions of the document. When the document had 20, 25 and 30 pages, the scripts would process them successfully. So it seems like, at least in the case of that document, 30 pages was the tipping point. Though I'm sure that could vary depending on the type of content on the pages. I need to work with documents much larger than this and can't just split them up into smaller documents unfortunately. Thanks.

jstrugnell avatar May 17 '24 10:05 jstrugnell

Just tried a different PDF. Worked at 30 pages, failed at 31.

jstrugnell avatar May 17 '24 10:05 jstrugnell

I'm seeing this same exception as well when trying to parse longer documents. We've validated that we are able to parse shorter documents (both .pdf and .docx files). Is there a root cause for this issue?

Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object
(InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Exception Details:      (InternalServerError) An unexpected error occurred.
        Code: InternalServerError
        Message: An unexpected error occurred.
        Target: 0

Occasionally, we'll also encounter a 403 error when attempting to parse longer documents. This looks like this:

Traceback (most recent call last):
  File "/home/gptadmin/Hike2/scripts/document_intelligence__scratch.py", line 17, in <module>
    parsed_content: str = parse_text_from_pdf__azure(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gptadmin/FlaskApps/SL_APP/helpers/azure_helpers.py", line 48, in parse_text_from_pdf__azure
    poller = document_intelligence_client.begin_analyze_document(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/azure/core/tracing/decorator.py", line 76, in wrapper_use_tracer
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/gptadmin/.local/lib/python3.11/site-packages/azure/ai/documentintelligence/_operations/_operations.py", line 3627, in begin_analyze_document
    raw_result = self._analyze_document_initial(  # type: ignore
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gptadmin/.local/lib/python3.11/site-packages/azure/ai/documentintelligence/_operations/_operations.py", line 518, in _analyze_document_initial
    raise HttpResponseError(response=response, model=error)
azure.core.exceptions.HttpResponseError: (403) Public access is disabled. Please configure private endpoint.
Code: 403
Message: Public access is disabled. Please configure private endpoint.

jacob-roach-hike2 avatar May 17 '24 16:05 jacob-roach-hike2

Hi all, if you are still having issues, please email me a document if you are able to share one (pamelafox@ microsoft .com) - the team hasn't been able to replicate it recently, so we need to figure out a way to replicate it.

pamelafox avatar May 22 '24 20:05 pamelafox

Hi all, if you are still having issues, please email me a document if you are able to share one (pamelafox@ microsoft .com) - the team hasn't been able to replicate it recently, so we need to figure out a way to disable it.

Unfortunately, I cannot share a document (confidential). However, I can confirm that I am seeing the following error again (as of this morning).

Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object
(InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Exception Details:      (InternalServerError) An unexpected error occurred.
        Code: InternalServerError
        Message: An unexpected error occurred.
        Target: 0

jacob-roach-hike2 avatar May 29 '24 12:05 jacob-roach-hike2

I am regularly encountering the same problem. However, it would seem that the larger documents will sometimes work just fine and at other times throw this error - this applies also to the example documents in this repo. Typically the only solution is retrying again later... which would suggest some internal issue with Azure Document Intelligence which would be difficult to reproduce.

pbkowalski avatar May 29 '24 13:05 pbkowalski

I am regularly encountering the same problem. However, it would seem that the larger documents will sometimes work just fine and at other times throw this error - this applies also to the example documents in this repo. Typically the only solution is retrying again later... which would suggest some internal issue with Azure Document Intelligence which would be difficult to reproduce.

Agreed, I have had the same experience. I didn't receive this error for over a week, and then this morning, I'm seeing it again. Unfortunately, my team is using Document Intelligence in a production-workflow, meaning we can't experience this sort of unpredictable downtime.

@pamelafox, when can we expect a resolution to this issue?

jacob-roach-hike2 avatar May 29 '24 13:05 jacob-roach-hike2

Just weighing in with my own experience - yesterday I observed this issue all day (with many re-attempts of the same files - a large PDF).

Today, the same files were ingested with no issues.

laneparton avatar May 30 '24 14:05 laneparton

Just weighing in with my own experience - yesterday I observed this issue all day (with many re-attempts of the same files - a large PDF).

Today, the same files were ingested with no issues.

This is the exact behavior that I observed as well. @pamelafox, do you have a root cause on why this might be the case?

jacob-roach-hike2 avatar May 30 '24 16:05 jacob-roach-hike2

Not yet, sorry! I was sent an example document to replicate earlier this week, so I will try to replicate with that today/tomorrow.

pamelafox avatar May 30 '24 17:05 pamelafox

I was able to replicate the error from @jacob-roach-hike2 -

Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object
(InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Exception Details:      (InternalServerError) An unexpected error occurred.
        Code: InternalServerError
        Message: An unexpected error occurred.
        Target: 0

I've sent the document, code, and error to the Document Intelligence team for them to hopefully replicate as well.

pamelafox avatar May 31 '24 13:05 pamelafox

Hi, @pamelafox. What we've detected after plenty of internal tests is that large PDF files associated with the formula detection feature make the Document Intelligence service crash somehow. After we removed it, it started working nicely again.

PS: I'm flagging into this sample repo because we found it about the same issue we were facing.

danielbichuetti avatar Jun 25 '24 09:06 danielbichuetti

Any updates regarding this issue? @pamelafox

Hiba13197 avatar Jun 27 '24 22:06 Hiba13197

Hi, @pamelafox

I am attaching a sample document that is creating the error for your troubleshooting purpose. Hope this is helpful.

Artificial Intelligence - A Modern Approach.pdf

drajinvites82 avatar Jul 02 '24 18:07 drajinvites82

@pamelafox Faced similar issue with Azure Document Intelligence. Method I am using to call doc intelligence is

async def _get_result_from_document_intelligence(path: str):
    document_intelligence_client = DocumentIntelligenceClient(
        AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT, AzureKeyCredential(DOCUMENT_INTELLIGENCE_API_KEY)
    )

    with open(path, "rb") as f:
        poller = document_intelligence_client.begin_analyze_document(
            "prebuilt-layout", analyze_request=f, content_type="application/octet-stream"
        )

    response = await asyncio.to_thread(poller.result)
    return response

benoit360l avatar Aug 05 '24 16:08 benoit360l