BabelDOC icon indicating copy to clipboard operation
BabelDOC copied to clipboard

big pdf output file lost (part file missing) after translation

Open hadesli opened this issue 8 months ago • 10 comments

Before you submit

  • [x] I have searched existing issues
  • [x] I spent at least 5 minutes investigating and preparing this report
  • [x] I confirmed this is not caused by a network issue
  • [x] I have fully read and understood the README
  • [x] I am certain that this issue is with BabelDOC itself and can be reproduced through the BabelDOC cli

Environment

- OS: win11 24H2
- Python: 3.10.10
- BabelDOC: 0.3.35

Describe the bug

after 2 hours working, babeldoc reports the part translation file missing, whole process corrupts

Steps to Reproduce

  1. Go to babeldoc to process pdf file
  2. after 2 hour's work
  3. See error

Expected Behavior

pdf outputfile

Relevant Log Output or Screenshots

[05/06/25 17:23:26] WARNING  WARNING:babeldoc.document_il.midend.il_translator_llm_only:Tr il_translator_llm_only.py:355
                             anslation result is too long or too short. Input: 7, Output:
                             22
                    WARNING  WARNING:babeldoc.document_il.midend.il_translator_llm_only:Fa il_translator_llm_only.py:382
                             llback to simple translation. paragraph id: i4xMZ
[05/06/25 17:23:29] WARNING  WARNING:babeldoc.document_il.midend.il_translator_llm_only:Tr il_translator_llm_only.py:355
                             anslation result is too long or too short. Input: 1, Output:
                             3
                    WARNING  WARNING:babeldoc.document_il.midend.il_translator_llm_only:Fa il_translator_llm_only.py:382
                             llback to simple translation. paragraph id: aXAYe
[05/06/25 17:23:40] INFO     INFO:babeldoc.document_il.backend.pdf_creater:PDF save with clean=False  pdf_creater.py:744
                             completed successfully
[05/06/25 17:23:41] INFO     INFO:babeldoc.high_level:start merge results                              high_level.py:550
                    INFO     INFO:babeldoc.high_level:Peak memory usage: 1791.92 MB                    high_level.py:369
                    ERROR    ERROR:babeldoc.high_level:translate error: no such file:                  high_level.py:576
                             'C:\Users\lishe\AppData\Local\Temp\tmpyz3rcmed\part_0_output\input.part0.
                             no_watermark.zh-CN.mono.pdf'
                    INFO     INFO:babeldoc.progress_monitor:progress_monitor handle              progress_monitor.py:247
                             translate_error: no such file:
                             'C:\Users\lishe\AppData\Local\Temp\tmpyz3rcmed\part_0_output\input.
                             part0.no_watermark.zh-CN.mono.pdf'
                    ERROR    ERROR:babeldoc.main:Error: no such file:                                        main.py:399
                             'C:\Users\lishe\AppData\Local\Temp\tmpyz3rcmed\part_0_output\input.part0.no_wat
                             ermark.zh-CN.mono.pdf'
translate                                                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸  99/100 2:47:19 0:00:02
Parse PDF and Create Intermediate Representation (15/15) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14   2:45:13 0:00:00
DetectScannedFile (1/15)                                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50/50   0:00:00 0:00:00
Parse Page Layout (15/15)                                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14   2:45:22 0:00:00
Parse Table (15/15)                                      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/2     2:44:47 0:00:00
Parse Paragraphs (15/15)                                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14   2:44:48 0:00:00
Parse Formulas and Styles (15/15)                        ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14   2:44:45 0:00:00
Translate Paragraphs (15/15)                             ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199/199 2:46:18 0:00:00
Typesetting (15/15)                                      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14   2:39:21 0:00:00
Add Fonts (15/15)                                        ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115/115 2:39:19 0:00:00
Generate drawing instructions (15/15)                    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14   2:39:20 0:00:00
Subset font (15/15)                                      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1     0:00:00 0:00:00
Save PDF (15/15)                                         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/2     0:00:03 0:00:00
                    INFO     INFO:babeldoc.main:Total tokens: 3618994                                        main.py:405
                    INFO     INFO:babeldoc.translation_config:cleanup temp files:              translation_config.py:246
                             C:\Users\lishe\AppData\Local\Temp\tmpyz3rcmed
                    INFO     INFO:babeldoc.main:Prompt tokens: 2742913                                       main.py:406
                    INFO     INFO:babeldoc.main:Completion tokens: 876081                                    main.py:407
                    INFO     INFO:babeldoc.high_level:Waiting for translation to finish...             high_level.py:323
                    INFO     INFO:babeldoc.document_il.translator.translator:openai translate call      translator.py:92
                             count: 5689
                    INFO     INFO:babeldoc.document_il.translator.translator:openai translate cache     translator.py:95
                             call count: 95

Original PDF File

US_Prospectus.pdf

Additional Context

config-dsv3.txt

hadesli avatar May 06 '25 09:05 hadesli

Have you used any tools to clean up cache folders during operation?

awwaawwa avatar May 06 '25 10:05 awwaawwa

Please check if Windows Storage Sense is turned off and if there is enough space on the C drive.

At the same time, please increase max-pages-per-part = 50. It is recommended to adjust it to 200+.

awwaawwa avatar May 06 '25 10:05 awwaawwa

Cannot reproduce locally, initially believe that some good tools/features have cleaned up temporary files. Suggest increasing the number of sharded pages, turning off these good tools/features, and ensuring sufficient C drive space before retrying. The issue has been closed for now. If there are any follow-ups, just comment directly, and I will respond when I see it.

awwaawwa avatar May 06 '25 10:05 awwaawwa

okok, there are 23G free space out of 400G C driver, I'm not sure if Windows system do the cleaning job background. I will try to remove some spare space for tmp file

BTW, why not locate the tmp file in the home directory?

hadesli avatar May 06 '25 11:05 hadesli

After the following change: 1, set max-pages-per-part = 150 2, 123G free sapce out of 400G C driver

try to translate 2 big files and get the same error again, I believe it's a bug :)

and I notice that, the error log reports XXXX.mono.pdf is missing, while in my config file, I set no-mono = true. I'm not sure if there's some logic missing while processing big file(need shard) and set no mono output

[05/06/25 22:30:25] WARNING WARNING:babeldoc.document_il.midend.il_translator_llm_only:Tr il_translator_llm_only.py:355 anslation result is too long or too short. Input: 1, Output: 3 WARNING WARNING:babeldoc.document_il.midend.il_translator_llm_only:Fa il_translator_llm_only.py:382 llback to simple translation. paragraph id: HEVF3 WARNING WARNING:babeldoc.document_il.midend.il_translator_llm_only:Tr il_translator_llm_only.py:355 anslation result is too long or too short. Input: 1, Output: 0 WARNING WARNING:babeldoc.document_il.midend.il_translator_llm_only:Fa il_translator_llm_only.py:382 llback to simple translation. paragraph id: GYAkA [05/06/25 22:30:49] INFO INFO:babeldoc.document_il.backend.pdf_creater:PDF save with clean=False pdf_creater.py:744 completed successfully INFO INFO:babeldoc.high_level:start merge results high_level.py:550 INFO INFO:babeldoc.high_level:Peak memory usage: 2125.41 MB high_level.py:369 ERROR ERROR:babeldoc.high_level:translate error: no such file: high_level.py:576 'C:\Users\lishe\AppData\Local\Temp\tmpxj47u6h1\part_0_output\input.part0. no_watermark.zh-CN.mono.pdf' INFO INFO:babeldoc.progress_monitor:progress_monitor handle progress_monitor.py:247 translate_error: no such file: 'C:\Users\lishe\AppData\Local\Temp\tmpxj47u6h1\part_0_output\input. part0.no_watermark.zh-CN.mono.pdf' ERROR ERROR:babeldoc.main:Error: no such file: main.py:399 'C:\Users\lishe\AppData\Local\Temp\tmpxj47u6h1\part_0_output\input.part0.no_wat ermark.zh-CN.mono.pdf' translate ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 99/100 0:31:32 0:00:05 Parse PDF and Create Intermediate Representation (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114/114 0:26:38 0:00:00 DetectScannedFile (1/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 150/150 0:00:01 0:00:00 Parse Page Layout (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114/114 0:28:28 0:00:00 Parse Table (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5/5 0:25:52 0:00:00 Parse Paragraphs (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114/114 0:25:56 0:00:00 Parse Formulas and Styles (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114/114 0:25:38 0:00:00 Translate Paragraphs (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1298/1298 0:27:29 0:00:00 Typesetting (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114/114 0:26:30 0:00:00 Add Fonts (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 736/736 0:26:17 0:00:00 Generate drawing instructions (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114/114 0:26:19 0:00:00 Subset font (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:00 0:00:00 Save PDF (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/2 0:00:05 0:00:00 INFO INFO:babeldoc.translation_config:cleanup temp files: translation_config.py:246 C:\Users\lishe\AppData\Local\Temp\tmpxj47u6h1 INFO INFO:babeldoc.main:Total tokens: 198457 main.py:405 INFO INFO:babeldoc.main:Prompt tokens: 142956 main.py:406 INFO INFO:babeldoc.main:Completion tokens: 55501 main.py:407 INFO INFO:babeldoc.high_level:Waiting for translation to finish... high_level.py:323 INFO INFO:babeldoc.document_il.translator.translator:openai translate call translator.py:92 count: 5689 INFO INFO:babeldoc.document_il.translator.translator:openai translate cache translator.py:95 call count: 5449

hadesli avatar May 06 '25 14:05 hadesli

another issue, it prompts error msg while the file suffix is PDF (capital words), hahaha

hadesli avatar May 06 '25 15:05 hadesli

In some special cases, BabelDOC cannot clean up its temporary files. If placed in the system's temporary file path, the system can help clean them up.

I will add an option later to allow you to put temporary files in the user's home directory.

Also, the issue with uppercase PDF suffix will be fixed later.

awwaawwa avatar May 06 '25 15:05 awwaawwa

parser.add_argument(
    "--working-dir",
    default=None,
    help="Working directory for translation. If not set, use temp directory.",
)

This parameter has been added, but uppercase .PDF has not been handled yet; will do it later.

awwaawwa avatar May 15 '25 11:05 awwaawwa

大兄弟,我知道是啥问题了

你在合并拆分文件的时候,会先去合并xxxx.mono.pdf,问题是我设置的就压根不生成mono,只生成dual。程序一看没找到mono,就直接中断了.....

[05/15/25 21:50:17] INFO INFO:babeldoc.document_il.backend.pdf_creater:PDF save with clean=False pdf_creater.py:748 completed successfully INFO INFO:babeldoc.high_level:start merge results high_level.py:569 INFO INFO:babeldoc.high_level:Peak memory usage: 2084.80 MB high_level.py:369 ERROR ERROR:babeldoc.high_level:translate error: no such file: high_level.py:595 'New-distribution-paradigms-for-railway-interlocking\part_0_output\input. part0.no_watermark.zh-CN.mono.pdf' INFO INFO:babeldoc.progress_monitor:progress_monitor handle progress_monitor.py:247 translate_error: no such file: 'New-distribution-paradigms-for-railway-interlocking\part_0_output
input.part0.no_watermark.zh-CN.mono.pdf' ERROR ERROR:babeldoc.main:Error: no such file: main.py:426 'New-distribution-paradigms-for-railway-interlocking\part_0_output\input.part0. no_watermark.zh-CN.mono.pdf'

hadesli avatar May 15 '25 13:05 hadesli

大兄弟,我知道是啥问题了

你在合并拆分文件的时候,会先去合并xxxx.mono.pdf,问题是我设置的就压根不生成mono,只生成dual。程序一看没找到mono,就直接中断了.....

[05/15/25 21:50:17] INFO INFO:babeldoc.document_il.backend.pdf_creater:PDF save with clean=False pdf_creater.py:748 completed successfully INFO INFO:babeldoc.high_level:start merge results high_level.py:569 INFO INFO:babeldoc.high_level:Peak memory usage: 2084.80 MB high_level.py:369 ERROR ERROR:babeldoc.high_level:translate error: no such file: high_level.py:595 'New-distribution-paradigms-for-railway-interlocking\part_0_output\input. part0.no_watermark.zh-CN.mono.pdf' INFO INFO:babeldoc.progress_monitor:progress_monitor handle progress_monitor.py:247 translate_error: no such file: 'New-distribution-paradigms-for-railway-interlocking\part_0_output input.part0.no_watermark.zh-CN.mono.pdf' ERROR ERROR:babeldoc.main:Error: no such file: main.py:426 'New-distribution-paradigms-for-railway-interlocking\part_0_output\input.part0. no_watermark.zh-CN.mono.pdf'

原来如此,后续修复

awwaawwa avatar May 15 '25 15:05 awwaawwa

如果我没记错的话,最近的版本修了。

awwaawwa avatar Jun 13 '25 03:06 awwaawwa