thorium-reader
thorium-reader copied to clipboard
TTS sentence splitter, migrate to native web API Intl.Segmenter
https://www.npmjs.com/package/sentence-splitter
https://github.com/textlint-rule/sentence-splitter/issues/28#issuecomment-2110632032
Edge cases to test: poetry, quotation marks and punctuation that make it hard to determine boundaries. Example: Alice in Wonderland (there are several editions, i think this one is useful for testing https://www.gutenberg.org/ebooks/28885 )
A good test for large sections of text (which would normally result in far-too-long speech utterances, and therefore benefit from sentence detection) is Georgia: https://idpf.github.io/epub3-samples/30/samples.html#georgia
Navigator code reference: https://github.com/readium/r2-navigator-js/blob/91482324fa2313c4536c48693eb091464a483071/src/electron/renderer/common/dom-text-utils.ts#L8
https://github.com/readium/r2-navigator-js/blob/91482324fa2313c4536c48693eb091464a483071/src/electron/renderer/common/dom-text-utils.ts#L935-L978
I'd like to resolve this. must verify that the sentence breaking algorithm still works with Japanese (Thorium's current third party lib works with a broader set of locales but is designed and maintained by a Japanese dev IIRC, so it works quite well with the JA scripts.