thorium-reader icon indicating copy to clipboard operation
thorium-reader copied to clipboard

TTS sentence splitter, migrate to native web API Intl.Segmenter

Open danielweck opened this issue 1 year ago • 3 comments

https://www.npmjs.com/package/sentence-splitter

https://github.com/textlint-rule/sentence-splitter/issues/28#issuecomment-2110632032

Edge cases to test: poetry, quotation marks and punctuation that make it hard to determine boundaries. Example: Alice in Wonderland (there are several editions, i think this one is useful for testing https://www.gutenberg.org/ebooks/28885 )

danielweck avatar May 14 '24 16:05 danielweck

A good test for large sections of text (which would normally result in far-too-long speech utterances, and therefore benefit from sentence detection) is Georgia: https://idpf.github.io/epub3-samples/30/samples.html#georgia

danielweck avatar May 14 '24 16:05 danielweck

Navigator code reference: https://github.com/readium/r2-navigator-js/blob/91482324fa2313c4536c48693eb091464a483071/src/electron/renderer/common/dom-text-utils.ts#L8

https://github.com/readium/r2-navigator-js/blob/91482324fa2313c4536c48693eb091464a483071/src/electron/renderer/common/dom-text-utils.ts#L935-L978

danielweck avatar May 14 '24 16:05 danielweck

I'd like to resolve this. must verify that the sentence breaking algorithm still works with Japanese (Thorium's current third party lib works with a broader set of locales but is designed and maintained by a Japanese dev IIRC, so it works quite well with the JA scripts.

danielweck avatar Feb 19 '25 09:02 danielweck