azure-search-openai-demo-csharp
azure-search-openai-demo-csharp copied to clipboard
Ensure chunked PDF documents are never bigger than 500 tokens, support CJK and fix bug with tiny documents
Purpose
- Better support CJK documents with ideographic and full-width unicode punctuation marks.
- Implement a recursive character splitting algorithm to make sure that all sections are < 500 tokens (the limit for Azure AI Search for this model)
- Also fixes #304
Both changes are based on improvements made to the Python sample
Does this introduce a breaking change?
[ ] Yes
[x] No
Pull Request Type
What kind of change does this Pull Request introduce?
[ ] Bugfix [x] Feature [ ] Code style update (formatting, local variables) [ ] Refactoring (no functional changes, no api changes) [ ] Documentation content changes [ ] Other... Please describe:
How to Test
- Get the code
git clone [repo-address]
cd [repo-name]
git checkout [branch-name]
npm install
- Test the code
What to Check
Verify that the following are valid
- ...
Other Information
The https://github.com/Azure-Samples/azure-search-openai-demo-csharp/pull/303/commits/e734ef135bccf89c5bcb11ab155646869f124a88 commit should fail, I added a test to prove #304
The changes in this PR are being introduced in SK as part of microsoft/semantic-kernel#5489
Once that's merged, this PR will be updated to reflect those changes.
cc: @tonybaloney
@tonybaloney Looks like the sk PR has been merged, would you still planning to update this PR to reflect that change
@tonybaloney Looks like the sk PR has been merged, would you still planning to update this PR to reflect that change
Yes, I'll wait for a new release of SK so I can test the changes