azure-search-openai-demo icon indicating copy to clipboard operation
azure-search-openai-demo copied to clipboard

Prep docs assumes certain punctuation

Open tonybaloney opened this issue 1 year ago • 6 comments

This demo app does work with other languages than English, however the prepdocs script makes some assumptions about the input characters.

For example, Japanese doesn't always punctuate sentences with a period and the symbol 。is more common than the ASCII . There are also quote marks, like 「 」 and the use of the triangle brackets ⟨ ⟩. Also the comma is a different unicode character.

We could use this CJK punctuation chart as a starting point and read the encoding of the input file.

I have no knowledge of other languages, Wikipedia is suggesting Hebrew and Arabic languages have some special punctuation but I don't know how exhaustive this is https://en.wikipedia.org/wiki/Category:Punctuation_of_specific_languages

tonybaloney avatar Oct 26 '23 07:10 tonybaloney

cc @ks6088ts for insights on whether they've seen this with their Japanese users

pamelafox avatar Oct 26 '23 13:10 pamelafox

@tonybaloney @pamelafox Great suggestion :) It is true that the script doesn't always punctuate Japanese sentences. Currently TextSplitter has word_breaks settings internally. IMO, it would be better to inject those settings from a constructor just as LangChain::CharacterTextSplitter does and add some descriptions about customization for respective languages.

ks6088ts avatar Nov 03 '23 07:11 ks6088ts

@tonybaloney @pamelafox

add some descriptions about customization for respective languages.

Providing information to use NLP related OSS, like spaCy, NLTK on README is my suggestion. In fact, LangChain provides text splitter interfaces for them.

ks6088ts avatar Nov 12 '23 14:11 ks6088ts

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.

github-actions[bot] avatar Jan 24 '24 01:01 github-actions[bot]

More work beyond prepdocs is needed to make this app multi-lingual.

https://learn.microsoft.com/en-us/azure/search/search-language-support

mattgotteiner avatar Feb 07 '24 23:02 mattgotteiner

We also need to adjust our splitter to be token based. Currently if you split a Chinese document at 1000 characters, you can't even fit three chunks in a single ChatCompletion call.

pamelafox avatar Feb 07 '24 23:02 pamelafox