epub_to_audiobook icon indicating copy to clipboard operation
epub_to_audiobook copied to clipboard

Refactor split_text function to handle Chinese text more effectively

Open Glowin opened this issue 1 year ago • 4 comments

  • Added regular expression to split Chinese text into sentences based on punctuation marks.
  • Ensured that each chunk's length is as close as possible to max_chars without splitting sentences abruptly.

Glowin avatar May 31 '24 08:05 Glowin

Thanks! Will take a look into this ASAP.

p0n1 avatar Jun 28 '24 08:06 p0n1

@p0n1 will leave it to you, have no idea about Chinese punctuation

Bryksin avatar Aug 24 '24 22:08 Bryksin

@p0n1 will leave it to you, have no idea about Chinese punctuation

Got it. I will try it this week.

p0n1 avatar Aug 26 '24 03:08 p0n1

Good improvements to the Chinese text splitting logic. The regex-based sentence splitting is more appropriate for Chinese language processing. There might be rare edge cases (e.g., very long sentences without punctuation) that could produce unexpected results. I'll test this implementation locally for a while to check for such cases.

p0n1 avatar Sep 05 '24 08:09 p0n1

I found that this PR doesn't segment mixed Chinese-English text well.

Now we have better sentence segmentation for most languages by https://github.com/p0n1/epub_to_audiobook/pull/131. Appreciate your contribution, closing this PR.

p0n1 avatar Apr 03 '25 14:04 p0n1