SubPlease icon indicating copy to clipboard operation
SubPlease copied to clipboard

Problem of alignment of English

Open chenyanming opened this issue 3 months ago • 10 comments

For epub + m4b, --lang en can not work, but japanese is ok. May I know why? subplz sync -d path --lang en or subplz sync -d path --language en

For english audible book + epub books. Is the command corrent? I know that this project is designed for Japanese. But I also saw that github mentioned that --lang may work other than Japanese. I have no success though.

chenyanming avatar Sep 14 '25 14:09 chenyanming

I'm unsure right now. You may try to use a txt file instead and see if that changes as some of the epub parsing is japanese specific if I remember right.

kanjieater avatar Sep 14 '25 14:09 kanjieater

Thank you for quick response. I tested mutiple english audio books with --language en option, it works great! It seems --lang en is different than --language en. With --lang en, it still prompt Detected Langauge: Japanese. But With --language en, I can see it transcribe English.

chenyanming avatar Sep 14 '25 23:09 chenyanming

Interesting. I'll try to look at why and update docs accordingly. Thanks for posting the solution. This worked for epibs still I take it?

kanjieater avatar Sep 15 '25 01:09 kanjieater

EPUB is working great. I haven't countered sync issues after using --language en option. Although sometimes I switched to turbo model. But default tiny should be fine.

chenyanming avatar Sep 15 '25 01:09 chenyanming

Although sometimes it crashed during the last sync step (or use watch + scanner), and closed the terminal automatically (memory is still a lot left 16 GB Memory, 4060), maybe sometimes rerun it could be fine (not very sure, but indeed it will crash)...

chenyanming avatar Sep 15 '25 01:09 chenyanming

I see, probably memory leak. I just saw memory jump beyond 16GB, and terminal close.

chenyanming avatar Sep 15 '25 01:09 chenyanming

There shouldn't be a memory leak. It does use that much memory for the alignment for very multi-hour audiobooks. At some point we'd like to have a more memory efficient algorithm for alignment, but the current one gives very accurate results so we've kept it and warn users when they run the program and in the readme.

UPDATE: Readme was updated, but i haven't removed the "lang" flag. We'll so i'll keep this open til that happens.

kanjieater avatar Sep 15 '25 12:09 kanjieater

I also got the memory issue. My PC has 96GB physical memory but it still crashed. In the task manager we can see that even if the physical memory is not full, the "committed" memory is full already, meaning the program probably declared a lot of memory, and crashed before it can use them. I will provide more information on where the OOM error occured.

instr3 avatar Nov 12 '25 09:11 instr3

Here is the memory error I captured:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "E:\miniconda3\envs\SubPlz\Scripts\subplz.exe\__main__.py", line 7, in <module>
  File "E:\miniconda3\envs\subplz\Lib\site-packages\subplz\__main__.py", line 5, in main
    run.execute_on_inputs()
  File "E:\miniconda3\envs\subplz\Lib\site-packages\subplz\run.py", line 44, in execute_on_inputs
    sync(
  File "E:\miniconda3\envs\subplz\Lib\site-packages\subplz\sync.py", line 146, in sync
    do_batch(
  File "E:\miniconda3\envs\subplz\Lib\site-packages\subplz\sync.py", line 95, in do_batch
    alignment, references = align.align(
                            ^^^^^^^^^^^^
  File "E:\miniconda3\envs\subplz\Lib\site-packages\ats\align.py", line 175, in align
    return inner(text), [] #references
           ^^^^^^^^^^^
  File "E:\miniconda3\envs\subplz\Lib\site-packages\ats\align.py", line 167, in inner
    alignment = aligner.align(text_joined, transcript_joined)[0]
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\miniconda3\envs\subplz\Lib\site-packages\Bio\Align\__init__.py", line 3969, in align
    score, paths = super().align(sA, sB, strand)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MemoryError
Press any key to continue . . .

instr3 avatar Nov 12 '25 09:11 instr3

I tried to debug the code a little bit. I think the reason is simply because in the command

alignment = aligner.align(text_joined, transcript_joined)[0]

The lengths len(text_joined) and len(transcript_joined) are too large (e.g., 1,000,000+). Maybe in English, character-level matching is not ideal and word-level matching would make it better.

instr3 avatar Nov 12 '25 10:11 instr3