Refactor split_text function to handle Chinese text more effectively
- Added regular expression to split Chinese text into sentences based on punctuation marks.
- Ensured that each chunk's length is as close as possible to max_chars without splitting sentences abruptly.
Thanks! Will take a look into this ASAP.
@p0n1 will leave it to you, have no idea about Chinese punctuation
@p0n1 will leave it to you, have no idea about Chinese punctuation
Got it. I will try it this week.
Good improvements to the Chinese text splitting logic. The regex-based sentence splitting is more appropriate for Chinese language processing. There might be rare edge cases (e.g., very long sentences without punctuation) that could produce unexpected results. I'll test this implementation locally for a while to check for such cases.
I found that this PR doesn't segment mixed Chinese-English text well.
Now we have better sentence segmentation for most languages by https://github.com/p0n1/epub_to_audiobook/pull/131. Appreciate your contribution, closing this PR.