azure-search-openai-demo
azure-search-openai-demo copied to clipboard
ResourceNotFoundError in predocs phase
Please provide us with the following information:
This issue is for a: (mark with an x)
- [X] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)
Minimal steps to reproduce
azd provision (and all resources are created fine, i think)
Any log messages given by the failure
Packaging services (azd package)
(✓) Done: Packaging service backend
- Package Output: /tmp/azure-search-openai-demo-backend-azddeploy-1712906840.zip
SUCCESS: Your application was packaged for Azure in 51 seconds. Checking if authentication should be setup... Loading azd .env file from current environment... AZURE_USE_AUTHENTICATION is set, proceeding with authentication setup... Creating Python virtual environment "app/backend/.venv"... Installing dependencies from "requirements.txt" into virtual environment (in quiet mode)... Not setting up authentication.
Provisioning Azure resources (azd provision) Provisioning Azure resources can take some time.
Subscription: icdev (bb369bdb-6d2a-483a-ac31-fa61b10cacfa) Location: East Asia
(-) Skipped: Didn't find new changes.
Loading azd .env file from current environment...
Creating Python virtual environment "app/backend/.venv"...
Installing dependencies from "requirements.txt" into virtual environment (in quiet mode)...
Not updating authentication.
Loading azd .env file from current environment...
Creating Python virtual environment "app/backend/.venv"...
Installing dependencies from "requirements.txt" into virtual environment (in quiet mode)...
Running "prepdocs.py"
Using local files: ./data/*
Ensuring search index gptkbindex exists
Search index gptkbindex already exists
Skipping ./data/employee_handbook.pdf, no changes detected.
Ingesting '2190.json'
Splitting '2190.json' into sections
Uploading blob for whole file -> 2190.json
Computed embeddings in batch. Batch size: 2, Token count: 303
Ingesting 'query.json'
Splitting 'query.json' into sections
Uploading blob for whole file -> query.json
Computed embeddings in batch. Batch size: 7, Token count: 2199
Ingesting '2192.json'
Splitting '2192.json' into sections
Uploading blob for whole file -> 2192.json
Computed embeddings in batch. Batch size: 2, Token count: 375
Ingesting '2191.json'
Splitting '2191.json' into sections
Uploading blob for whole file -> 2191.json
Computed embeddings in batch. Batch size: 2, Token count: 418
Ingesting '2189.json'
Splitting '2189.json' into sections
Uploading blob for whole file -> 2189.json
Computed embeddings in batch. Batch size: 1, Token count: 205
Ingesting 'Benefit_Options.pdf'
Extracting text from './data/Benefit_Options.pdf' using Azure Document Intelligence
Traceback (most recent call last):
File "/workspaces/azure-search-openai-demo/./app/backend/prepdocs.py", line 494, in
ERROR: failed running post hooks: 'postprovision' hook failed with exit code: '1', Path: '/tmp/azd-postprovision-1942051034.sh'. : exit code: 1
ERROR: error executing step command 'provision': failed running post hooks: 'postprovision' hook failed with exit code: '1', Path: '/tmp/azd-postprovision-1942051034.sh'. : exit code: 1
Expected/desired behavior
OS and Version?
Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) Github codespaces with python 3.11
azd version?
run
azd versionand copy paste here. azd version 1.8.0 (commit 8246323c2472148288be4b3cbc3c424bd046b985)
Versions
main branch as of April 12, 2024.
Mention any other details that might be useful
Thanks! We'll be in touch soon.
Make sure your document intelligence service in one of these regions : westus2, eastus, or westeurope
In my case the issue only occurs with the latest azure-ai-documentintelligence package and private endpoints presents. Using the older package azure.ai.formrecognizer seems to solve the issue.
@jimmylevell You say you're using private endpoints? Did you add those manually in the Portal? We have a PR which adds support for private endpoints, but it's not yet in main. We're still testing that, so I don't know if we've seen issues with using Document Intelligence. FYI to @mattgotteiner who's working on that PR.
@pamelafox thank you for your fast reply. We needed to deploy the solution based on internally policies manually within Azure. In this process all Azure resources have been configured with private endpoints. The solution is working as expected within our tenant (Azure Switzerland North). The only required change we needed to introduce was reverting to an older form recognizer dependency in the pdfparser.py.
- from azure.ai.documentintelligence.aio import DocumentIntelligenceClient + from azure.ai.formrecognizer.aio import DocumentAnalysisClient
The issue with the form recognizer can also be illustrated using the following demo-code:
Working Sample
Based on Document Intelligence Studio Sample
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient
endpoint = "https://<private-endpoint-instance>.cognitiveservices.azure.com/"
key = ""
formUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/sample-layout.pdf"
document_analysis_client = DocumentAnalysisClient(
endpoint=endpoint, credential=AzureKeyCredential(key)
)
poller = document_analysis_client.begin_analyze_document_from_url("prebuilt-document", formUrl)
result = poller.result()
for kv_pair in result.key_value_pairs:
if kv_pair.key and kv_pair.value:
print("Key '{}': Value: '{}'".format(kv_pair.key.content, kv_pair.value.content))
else:
print("Key '{}': Value:".format(kv_pair.key.content))
None-working Sample
Based on MS Docs Sample
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult
endpoint = "https://<same private endpoint instance>.cognitiveservices.azure.com/"
key = ""
document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))
with open("./data/OHB5336.pdf", "rb") as f:
poller = document_intelligence_client.begin_analyze_document(
"prebuilt-layout", analyze_request=f, content_type="application/octet-stream"
)
result: AnalyzeResult = poller.result()
Therefore, I believe the issue is related with the newer document intelligence package.
I'm seeing this too, freshly downloaded repository.
Hm, so we've gotten private endpoints working in PR here: https://github.com/Azure-Samples/azure-search-openai-demo/pull/864/files So I'm looking to see what Document Intelligence specific changes are in there that might be relevant. There's this configuration of network bypass: https://github.com/Azure-Samples/azure-search-openai-demo/pull/864/files#diff-8a64001dc63e4053382af7bbd6519e074e3a637a9e0a50b5b6e8ca136b4224ceR37 That's the only change I see specific to Document Intelligence.
We don't change our URL to Document Intelligence, did you change your URL? I don't believe that should be necessary.
(I am still learning Azure networking so I may be wrong)
cc @mattgotteiner in case he has insights
Just to clarify we have the main branch of the app running our Azure instructure. Each resource only uses private endpoints. As mentioned we have configured each resource manually, but no code changes were required (besides the mentioned form recognizer library). Let me know if any further information would help your PR.
What is weird is that that I am using the same configuration once with the form recognizer library and once with the document intelligence library. The later throwing the error ResourceNotFound. I am accessing the default form recognizer url provided in the Azure portal, which is pointing to the private endpoint IP.
(same here)
Had the same issue. Root-cause is due to deploying the Document Intelligence resource into the Australia East environment (need for client demonstration). Resolution was to implement the similar solution to what was recommended here by @jimmylevell :)
Replacing the Document Intelligence related components (i.e. DocumentIntelligenceClient and DocumentTable modules) with Form Recognizer components (i.e. DocumentAnalysisClient and FormTable) with the below code:
from azure.ai.formrecognizer.aio import DocumentAnalysisClient
from azure.ai.formrecognizer import FormTable
...
async with DocumentAnalysisClient(
endpoint=self.endpoint, credential=self.credential
) as document_analysis_client:
poller = await document_analysis_client.begin_analyze_document(
model_id=self.model_id, document=content
)
form_recognizer_results = await poller.result()
offset = 0
for page_num, page in enumerate(form_recognizer_results.pages):
tables_on_page = [
table
for table in (form_recognizer_results.tables or [])
if table.bounding_regions and table.bounding_regions[0].page_number == page_num + 1
]
# mark all positions of the table spans in the page
page_offset = page.spans[0].offset
page_length = page.spans[0].length
table_chars = [-1] * page_length
for table_id, table in enumerate(tables_on_page):
for span in table.spans:
# replace all table spans with "table_id" in table_chars array
for i in range(span.length):
idx = span.offset - page_offset + i
if idx >= 0 and idx < page_length:
table_chars[idx] = table_id
# build page text by replacing characters in table spans with table html
page_text = ""
added_tables = set()
for idx, table_id in enumerate(table_chars):
if table_id == -1:
page_text += form_recognizer_results.content[page_offset + idx]
elif table_id not in added_tables:
page_text += DocumentAnalysisParser.table_to_html(tables_on_page[table_id])
added_tables.add(table_id)
yield Page(page_num=page_num, offset=offset, text=page_text)
offset += len(page_text)
@classmethod
def table_to_html(cls, table: FormTable):
table_html = "<table>"
rows = [
sorted([cell for cell in table.cells if cell.row_index == i], key=lambda cell: cell.column_index)
for i in range(table.row_count)
]
for row_cells in rows:
table_html += "<tr>"
for cell in row_cells:
tag = "th" if (cell.kind == "columnHeader" or cell.kind == "rowHeader") else "td"
cell_spans = ""
if cell.column_span is not None and cell.column_span > 1:
cell_spans += f" colSpan={cell.column_span}"
if cell.row_span is not None and cell.row_span > 1:
cell_spans += f" rowSpan={cell.row_span}"
table_html += f"<{tag}{cell_spans}>{html.escape(cell.content)}</{tag}>"
table_html += "</tr>"
table_html += "</table>"
return table_html