azure-search-openai-demo-csharp Ensure chunked PDF documents are never bigger than 500 tokens, support CJK and fix bug with tiny documents

Purpose

Better support CJK documents with ideographic and full-width unicode punctuation marks.
Implement a recursive character splitting algorithm to make sure that all sections are < 500 tokens (the limit for Azure AI Search for this model)
Also fixes #304

Both changes are based on improvements made to the Python sample

Does this introduce a breaking change?

[ ] Yes
[x] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix [x] Feature [ ] Code style update (formatting, local variables) [ ] Refactoring (no functional changes, no api changes) [ ] Documentation content changes [ ] Other... Please describe:

How to Test

Get the code

git clone [repo-address]
cd [repo-name]
git checkout [branch-name]
npm install

Test the code

What to Check

Verify that the following are valid

...

Other Information

Mar 14 '24 02:03 tonybaloney

The https://github.com/Azure-Samples/azure-search-openai-demo-csharp/pull/303/commits/e734ef135bccf89c5bcb11ab155646869f124a88 commit should fail, I added a test to prove #304

Mar 14 '24 05:03 tonybaloney

The changes in this PR are being introduced in SK as part of microsoft/semantic-kernel#5489

Once that's merged, this PR will be updated to reflect those changes.

cc: @tonybaloney

Mar 15 '24 20:03 luisquintanilla

@tonybaloney Looks like the sk PR has been merged, would you still planning to update this PR to reflect that change

Mar 25 '24 21:03 LittleLittleCloud

@tonybaloney Looks like the sk PR has been merged, would you still planning to update this PR to reflect that change

Yes, I'll wait for a new release of SK so I can test the changes

Mar 25 '24 21:03 tonybaloney