azure-search-openai-demo
azure-search-openai-demo copied to clipboard
Getting error during extracting text from pdf while ding deployment
Please provide us with the following information:
This issue is for a: (mark with an x
)
- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)
Minimal steps to reproduce
Any log messages given by the failure
Expected/desired behavior
OS and Version?
Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)
azd version?
run
azd version
and copy paste here.
Versions
Mention any other details that might be useful
Thanks! We'll be in touch soon.
Uploading blob for whole file -> Deep Learning.pdf Extracting text from 'C:\CsuEnterpriseSearch/data\Introduction_to_algorithms-3rd Edition.pdf' using Azure Form Recognizer Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object Traceback (most recent call last): File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline\transport_aiohttp.py", line 484, in load_body self._content = await self.internal_response.read() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\aiohttp\client_reqrep.py", line 1037, in read self._body = await self.content.read() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\aiohttp\streams.py", line 375, in read block = await self.readany() ^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\aiohttp\streams.py", line 397, in readany await self._wait("readany") File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\aiohttp\streams.py", line 304, in _wait await waiter aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\CsuEnterpriseSearch\scripts\prepdocs.py", line 256, in
It completes extraction for one of the pdf.But throws error while doing extraction for second pdf as shown in log above.
Hello, I'm also getting the same error.
Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object Traceback (most recent call last):
Same problem after running ./scripts/prepdocs.sh
.
In my case, it happens when ingest some PDFs with larger pages (such as an entire 300-page book). It will get stuck on the following prompt for about 5-15 minutes before the error happens:
Extracting text from './data/demobook.pdf' using Azure Document Intelligence
here is the full log:
Extracting text from './data/demobook.pdf' using Azure Document Intelligence Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object Traceback (most recent call last): File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/transport/_aiohttp.py", line 501, in load_body self._content = await self.internal_response.read() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1100, in read self._body = await self.content.read() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/aiohttp/streams.py", line 373, in read block = await self.readany() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/aiohttp/streams.py", line 395, in readany await self._wait("readany") File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/aiohttp/streams.py", line 302, in _wait await waiter aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Applications/azure-ai-research/./scripts/prepdocs.py", line 256, in
Any solution? thanks
While using Azure AI Document Intelligence, I am facing the similar issue:
Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object
An error occurred: (InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Exception Details: (FailedToSerializeAnalyzeResult) Failed to serialize analyze results, please contact support.
Code: FailedToSerializeAnalyzeResult
Message: Failed to serialize analyze results, please contact support.
Is this issue still under consideration for resolution? Thanks
@hammad26 Are you able to email the PDF where you experienced the issue to pamelafox@ microsoft.com? If I can replicate the error, then I can more easily share it with the Document Intelligence team. Otherwise, please indicate the size of the PDF file that caused the error.
@pamelafox I have just sent you the problematic document.
Update: The Document Intelligence team is now investigating.
@pamelafox Any updates on the investigation? Thanks
@pamelafox I am facing the same issue. Any updates? Many thanks in advance.
@pamelafox Same for us when processing Excel files of a certain size. Workaround we have is to split the excels into multiple ones.
Hi. Is there any update on this issue, or workaround please? I'm hitting the same problem, with larger PDFs, which includes some of the files in the sample dataset. Interestingly I took the "role_library.pdf" document, which has 31 pages, and extracted shortened versions of the document. When the document had 20, 25 and 30 pages, the scripts would process them successfully. So it seems like, at least in the case of that document, 30 pages was the tipping point. Though I'm sure that could vary depending on the type of content on the pages. I need to work with documents much larger than this and can't just split them up into smaller documents unfortunately. Thanks.
Just tried a different PDF. Worked at 30 pages, failed at 31.
I'm seeing this same exception as well when trying to parse longer documents. We've validated that we are able to parse shorter documents (both .pdf and .docx files). Is there a root cause for this issue?
Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object
(InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Exception Details: (InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Target: 0
Occasionally, we'll also encounter a 403 error when attempting to parse longer documents. This looks like this:
Traceback (most recent call last):
File "/home/gptadmin/Hike2/scripts/document_intelligence__scratch.py", line 17, in <module>
parsed_content: str = parse_text_from_pdf__azure(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gptadmin/FlaskApps/SL_APP/helpers/azure_helpers.py", line 48, in parse_text_from_pdf__azure
poller = document_intelligence_client.begin_analyze_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/azure/core/tracing/decorator.py", line 76, in wrapper_use_tracer
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/gptadmin/.local/lib/python3.11/site-packages/azure/ai/documentintelligence/_operations/_operations.py", line 3627, in begin_analyze_document
raw_result = self._analyze_document_initial( # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gptadmin/.local/lib/python3.11/site-packages/azure/ai/documentintelligence/_operations/_operations.py", line 518, in _analyze_document_initial
raise HttpResponseError(response=response, model=error)
azure.core.exceptions.HttpResponseError: (403) Public access is disabled. Please configure private endpoint.
Code: 403
Message: Public access is disabled. Please configure private endpoint.
Hi all, if you are still having issues, please email me a document if you are able to share one (pamelafox@ microsoft .com) - the team hasn't been able to replicate it recently, so we need to figure out a way to replicate it.
Hi all, if you are still having issues, please email me a document if you are able to share one (pamelafox@ microsoft .com) - the team hasn't been able to replicate it recently, so we need to figure out a way to disable it.
Unfortunately, I cannot share a document (confidential). However, I can confirm that I am seeing the following error again (as of this morning).
Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object
(InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Exception Details: (InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Target: 0
I am regularly encountering the same problem. However, it would seem that the larger documents will sometimes work just fine and at other times throw this error - this applies also to the example documents in this repo. Typically the only solution is retrying again later... which would suggest some internal issue with Azure Document Intelligence which would be difficult to reproduce.
I am regularly encountering the same problem. However, it would seem that the larger documents will sometimes work just fine and at other times throw this error - this applies also to the example documents in this repo. Typically the only solution is retrying again later... which would suggest some internal issue with Azure Document Intelligence which would be difficult to reproduce.
Agreed, I have had the same experience. I didn't receive this error for over a week, and then this morning, I'm seeing it again. Unfortunately, my team is using Document Intelligence in a production-workflow, meaning we can't experience this sort of unpredictable downtime.
@pamelafox, when can we expect a resolution to this issue?
Just weighing in with my own experience - yesterday I observed this issue all day (with many re-attempts of the same files - a large PDF).
Today, the same files were ingested with no issues.
Just weighing in with my own experience - yesterday I observed this issue all day (with many re-attempts of the same files - a large PDF).
Today, the same files were ingested with no issues.
This is the exact behavior that I observed as well. @pamelafox, do you have a root cause on why this might be the case?
Not yet, sorry! I was sent an example document to replicate earlier this week, so I will try to replicate with that today/tomorrow.
I was able to replicate the error from @jacob-roach-hike2 -
Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object
(InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Exception Details: (InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Target: 0
I've sent the document, code, and error to the Document Intelligence team for them to hopefully replicate as well.
Hi, @pamelafox. What we've detected after plenty of internal tests is that large PDF files associated with the formula detection feature make the Document Intelligence service crash somehow. After we removed it, it started working nicely again.
PS: I'm flagging into this sample repo because we found it about the same issue we were facing.
Any updates regarding this issue? @pamelafox
Hi, @pamelafox
I am attaching a sample document that is creating the error for your troubleshooting purpose. Hope this is helpful.
@pamelafox Faced similar issue with Azure Document Intelligence. Method I am using to call doc intelligence is
async def _get_result_from_document_intelligence(path: str):
document_intelligence_client = DocumentIntelligenceClient(
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT, AzureKeyCredential(DOCUMENT_INTELLIGENCE_API_KEY)
)
with open(path, "rb") as f:
poller = document_intelligence_client.begin_analyze_document(
"prebuilt-layout", analyze_request=f, content_type="application/octet-stream"
)
response = await asyncio.to_thread(poller.result)
return response