azure-search-openai-demo
azure-search-openai-demo copied to clipboard
Small documets are not indexing
Please provide us with the following information:
This issue is for a: (mark with an x)
- [X] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)
Minimal steps to reproduce
init azd up azd deploy
Any log messages given by the failure
No error, empty search index entry
Expected/desired behavior
Search index should be field from given PDF file
OS and Version?
Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) Created from code space
azd version?
run
azd versionand copy paste here. 1.5.1
Versions
Mention any other details that might be useful
Thanks! We'll be in touch soon.
Hi @RezaMatin, we need some more details to understand this issue
- When you say small documents, what do you mean? Are these short pdfs, text files, something else?
- Are you using the existing sample data as well or have you totally replaced it? Does your app work with the sample data? Thanks
Hi @mattgotteiner
Thanks for your reply. Everything works fine with the included data. I have removed the data and tested it with different PDF sizes. PDF document with just a few lines does not appear in the search index.
Thanks
Hi @RezaMatin, we need some more details to understand this issue
- When you say small documents, what do you mean? Are these short pdfs, text files, something else?
- Are you using the existing sample data as well or have you totally replaced it? Does your app work with the sample data? Thanks
Hi @mattgotteiner and @RezaMatin ,
I've pushed fix for this bug: https://github.com/Azure-Samples/azure-search-openai-demo/pull/1155
Best regards, Marek
@mattgotteiner I have a related issue that cannot be solved by https://github.com/Azure-Samples/azure-search-openai-demo/pull/1155
I have several pdf-documents that are long but have little text at every page (converted from pptx). Because of the current chunking first 4-5 pages are indexed as a first page. I'm thinking about something like this as a solution, but it won't work well together with 1155 for documents with several pages that are all together shorter than MAX_SECTION_LENGTH:
while start + self.SECTION_OVERLAP < length:
current_page_number = find_page(start)
current_page_content = page_map[current_page_number][2]
current_page_len = len(current_page_content)
is_short_page = current_page_len < self.MAX_SECTION_LENGTH
last_word = -1
end = start + current_page_len if is_short_page else start + self.MAX_SECTION_LENGTH
if is_short_page:
# if a page is shorter than MAX_SECTION_LENGTH, the chunk size is set to the page size
start += current_page_len
yield (current_page_content, current_page_number)
continue