azure-search-openai-demo-csharp icon indicating copy to clipboard operation
azure-search-openai-demo-csharp copied to clipboard

Ensure chunked PDF documents are never bigger than 500 tokens, support CJK and fix bug with tiny documents

Open tonybaloney opened this issue 1 year ago • 4 comments

Purpose

  1. Better support CJK documents with ideographic and full-width unicode punctuation marks.
  2. Implement a recursive character splitting algorithm to make sure that all sections are < 500 tokens (the limit for Azure AI Search for this model)
  3. Also fixes #304

Both changes are based on improvements made to the Python sample

Does this introduce a breaking change?

[ ] Yes
[x] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix [x] Feature [ ] Code style update (formatting, local variables) [ ] Refactoring (no functional changes, no api changes) [ ] Documentation content changes [ ] Other... Please describe:

How to Test

  • Get the code
git clone [repo-address]
cd [repo-name]
git checkout [branch-name]
npm install
  • Test the code

What to Check

Verify that the following are valid

  • ...

Other Information

tonybaloney avatar Mar 14 '24 02:03 tonybaloney

The https://github.com/Azure-Samples/azure-search-openai-demo-csharp/pull/303/commits/e734ef135bccf89c5bcb11ab155646869f124a88 commit should fail, I added a test to prove #304

tonybaloney avatar Mar 14 '24 05:03 tonybaloney

The changes in this PR are being introduced in SK as part of microsoft/semantic-kernel#5489

Once that's merged, this PR will be updated to reflect those changes.

cc: @tonybaloney

luisquintanilla avatar Mar 15 '24 20:03 luisquintanilla

@tonybaloney Looks like the sk PR has been merged, would you still planning to update this PR to reflect that change

LittleLittleCloud avatar Mar 25 '24 21:03 LittleLittleCloud

@tonybaloney Looks like the sk PR has been merged, would you still planning to update this PR to reflect that change

Yes, I'll wait for a new release of SK so I can test the changes

tonybaloney avatar Mar 25 '24 21:03 tonybaloney