azure-search-openai-demo icon indicating copy to clipboard operation
azure-search-openai-demo copied to clipboard

Small documets are not indexing

Open RezaMatin opened this issue 1 year ago • 4 comments

Please provide us with the following information:

This issue is for a: (mark with an x)

- [X] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

init azd up azd deploy

Any log messages given by the failure

No error, empty search index entry

Expected/desired behavior

Search index should be field from given PDF file

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) Created from code space

azd version?

run azd version and copy paste here. 1.5.1

Versions

Mention any other details that might be useful


Thanks! We'll be in touch soon.

RezaMatin avatar Jan 18 '24 12:01 RezaMatin

Hi @RezaMatin, we need some more details to understand this issue

  1. When you say small documents, what do you mean? Are these short pdfs, text files, something else?
  2. Are you using the existing sample data as well or have you totally replaced it? Does your app work with the sample data? Thanks

mattgotteiner avatar Jan 18 '24 16:01 mattgotteiner

Hi @mattgotteiner
Thanks for your reply. Everything works fine with the included data. I have removed the data and tested it with different PDF sizes. PDF document with just a few lines does not appear in the search index. Thanks

RezaMatin avatar Jan 18 '24 22:01 RezaMatin

Hi @RezaMatin, we need some more details to understand this issue

  1. When you say small documents, what do you mean? Are these short pdfs, text files, something else?
  2. Are you using the existing sample data as well or have you totally replaced it? Does your app work with the sample data? Thanks

Hi @mattgotteiner and @RezaMatin ,

I've pushed fix for this bug: https://github.com/Azure-Samples/azure-search-openai-demo/pull/1155

Best regards, Marek

marekjakimiuk1 avatar Jan 19 '24 10:01 marekjakimiuk1

@mattgotteiner I have a related issue that cannot be solved by https://github.com/Azure-Samples/azure-search-openai-demo/pull/1155

I have several pdf-documents that are long but have little text at every page (converted from pptx). Because of the current chunking first 4-5 pages are indexed as a first page. I'm thinking about something like this as a solution, but it won't work well together with 1155 for documents with several pages that are all together shorter than MAX_SECTION_LENGTH:

       while start + self.SECTION_OVERLAP < length:
            current_page_number = find_page(start)
            current_page_content = page_map[current_page_number][2]
            current_page_len = len(current_page_content)
            is_short_page = current_page_len < self.MAX_SECTION_LENGTH

            last_word = -1
            end = start + current_page_len if is_short_page else start + self.MAX_SECTION_LENGTH

            if is_short_page:
                # if a page is shorter than MAX_SECTION_LENGTH, the chunk size is set to the page size
                start += current_page_len
                yield (current_page_content, current_page_number)
                continue

elhele avatar Feb 20 '24 18:02 elhele