big pdf output file lost (part file missing) after translation
Before you submit
- [x] I have searched existing issues
- [x] I spent at least 5 minutes investigating and preparing this report
- [x] I confirmed this is not caused by a network issue
- [x] I have fully read and understood the README
- [x] I am certain that this issue is with BabelDOC itself and can be reproduced through the BabelDOC cli
Environment
- OS: win11 24H2
- Python: 3.10.10
- BabelDOC: 0.3.35
Describe the bug
after 2 hours working, babeldoc reports the part translation file missing, whole process corrupts
Steps to Reproduce
- Go to babeldoc to process pdf file
- after 2 hour's work
- See error
Expected Behavior
pdf outputfile
Relevant Log Output or Screenshots
[05/06/25 17:23:26] WARNING WARNING:babeldoc.document_il.midend.il_translator_llm_only:Tr il_translator_llm_only.py:355
anslation result is too long or too short. Input: 7, Output:
22
WARNING WARNING:babeldoc.document_il.midend.il_translator_llm_only:Fa il_translator_llm_only.py:382
llback to simple translation. paragraph id: i4xMZ
[05/06/25 17:23:29] WARNING WARNING:babeldoc.document_il.midend.il_translator_llm_only:Tr il_translator_llm_only.py:355
anslation result is too long or too short. Input: 1, Output:
3
WARNING WARNING:babeldoc.document_il.midend.il_translator_llm_only:Fa il_translator_llm_only.py:382
llback to simple translation. paragraph id: aXAYe
[05/06/25 17:23:40] INFO INFO:babeldoc.document_il.backend.pdf_creater:PDF save with clean=False pdf_creater.py:744
completed successfully
[05/06/25 17:23:41] INFO INFO:babeldoc.high_level:start merge results high_level.py:550
INFO INFO:babeldoc.high_level:Peak memory usage: 1791.92 MB high_level.py:369
ERROR ERROR:babeldoc.high_level:translate error: no such file: high_level.py:576
'C:\Users\lishe\AppData\Local\Temp\tmpyz3rcmed\part_0_output\input.part0.
no_watermark.zh-CN.mono.pdf'
INFO INFO:babeldoc.progress_monitor:progress_monitor handle progress_monitor.py:247
translate_error: no such file:
'C:\Users\lishe\AppData\Local\Temp\tmpyz3rcmed\part_0_output\input.
part0.no_watermark.zh-CN.mono.pdf'
ERROR ERROR:babeldoc.main:Error: no such file: main.py:399
'C:\Users\lishe\AppData\Local\Temp\tmpyz3rcmed\part_0_output\input.part0.no_wat
ermark.zh-CN.mono.pdf'
translate ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 99/100 2:47:19 0:00:02
Parse PDF and Create Intermediate Representation (15/15) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 2:45:13 0:00:00
DetectScannedFile (1/15) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50/50 0:00:00 0:00:00
Parse Page Layout (15/15) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 2:45:22 0:00:00
Parse Table (15/15) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/2 2:44:47 0:00:00
Parse Paragraphs (15/15) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 2:44:48 0:00:00
Parse Formulas and Styles (15/15) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 2:44:45 0:00:00
Translate Paragraphs (15/15) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199/199 2:46:18 0:00:00
Typesetting (15/15) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 2:39:21 0:00:00
Add Fonts (15/15) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115/115 2:39:19 0:00:00
Generate drawing instructions (15/15) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 2:39:20 0:00:00
Subset font (15/15) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:00 0:00:00
Save PDF (15/15) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/2 0:00:03 0:00:00
INFO INFO:babeldoc.main:Total tokens: 3618994 main.py:405
INFO INFO:babeldoc.translation_config:cleanup temp files: translation_config.py:246
C:\Users\lishe\AppData\Local\Temp\tmpyz3rcmed
INFO INFO:babeldoc.main:Prompt tokens: 2742913 main.py:406
INFO INFO:babeldoc.main:Completion tokens: 876081 main.py:407
INFO INFO:babeldoc.high_level:Waiting for translation to finish... high_level.py:323
INFO INFO:babeldoc.document_il.translator.translator:openai translate call translator.py:92
count: 5689
INFO INFO:babeldoc.document_il.translator.translator:openai translate cache translator.py:95
call count: 95
Original PDF File
Additional Context
Have you used any tools to clean up cache folders during operation?
Please check if Windows Storage Sense is turned off and if there is enough space on the C drive.
At the same time, please increase max-pages-per-part = 50. It is recommended to adjust it to 200+.
Cannot reproduce locally, initially believe that some good tools/features have cleaned up temporary files. Suggest increasing the number of sharded pages, turning off these good tools/features, and ensuring sufficient C drive space before retrying. The issue has been closed for now. If there are any follow-ups, just comment directly, and I will respond when I see it.
okok, there are 23G free space out of 400G C driver, I'm not sure if Windows system do the cleaning job background. I will try to remove some spare space for tmp file
BTW, why not locate the tmp file in the home directory?
After the following change: 1, set max-pages-per-part = 150 2, 123G free sapce out of 400G C driver
try to translate 2 big files and get the same error again, I believe it's a bug :)
and I notice that, the error log reports XXXX.mono.pdf is missing, while in my config file, I set no-mono = true. I'm not sure if there's some logic missing while processing big file(need shard) and set no mono output
[05/06/25 22:30:25] WARNING WARNING:babeldoc.document_il.midend.il_translator_llm_only:Tr il_translator_llm_only.py:355 anslation result is too long or too short. Input: 1, Output: 3 WARNING WARNING:babeldoc.document_il.midend.il_translator_llm_only:Fa il_translator_llm_only.py:382 llback to simple translation. paragraph id: HEVF3 WARNING WARNING:babeldoc.document_il.midend.il_translator_llm_only:Tr il_translator_llm_only.py:355 anslation result is too long or too short. Input: 1, Output: 0 WARNING WARNING:babeldoc.document_il.midend.il_translator_llm_only:Fa il_translator_llm_only.py:382 llback to simple translation. paragraph id: GYAkA [05/06/25 22:30:49] INFO INFO:babeldoc.document_il.backend.pdf_creater:PDF save with clean=False pdf_creater.py:744 completed successfully INFO INFO:babeldoc.high_level:start merge results high_level.py:550 INFO INFO:babeldoc.high_level:Peak memory usage: 2125.41 MB high_level.py:369 ERROR ERROR:babeldoc.high_level:translate error: no such file: high_level.py:576 'C:\Users\lishe\AppData\Local\Temp\tmpxj47u6h1\part_0_output\input.part0. no_watermark.zh-CN.mono.pdf' INFO INFO:babeldoc.progress_monitor:progress_monitor handle progress_monitor.py:247 translate_error: no such file: 'C:\Users\lishe\AppData\Local\Temp\tmpxj47u6h1\part_0_output\input. part0.no_watermark.zh-CN.mono.pdf' ERROR ERROR:babeldoc.main:Error: no such file: main.py:399 'C:\Users\lishe\AppData\Local\Temp\tmpxj47u6h1\part_0_output\input.part0.no_wat ermark.zh-CN.mono.pdf' translate ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 99/100 0:31:32 0:00:05 Parse PDF and Create Intermediate Representation (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114/114 0:26:38 0:00:00 DetectScannedFile (1/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 150/150 0:00:01 0:00:00 Parse Page Layout (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114/114 0:28:28 0:00:00 Parse Table (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5/5 0:25:52 0:00:00 Parse Paragraphs (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114/114 0:25:56 0:00:00 Parse Formulas and Styles (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114/114 0:25:38 0:00:00 Translate Paragraphs (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1298/1298 0:27:29 0:00:00 Typesetting (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114/114 0:26:30 0:00:00 Add Fonts (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 736/736 0:26:17 0:00:00 Generate drawing instructions (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114/114 0:26:19 0:00:00 Subset font (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:00 0:00:00 Save PDF (5/5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/2 0:00:05 0:00:00 INFO INFO:babeldoc.translation_config:cleanup temp files: translation_config.py:246 C:\Users\lishe\AppData\Local\Temp\tmpxj47u6h1 INFO INFO:babeldoc.main:Total tokens: 198457 main.py:405 INFO INFO:babeldoc.main:Prompt tokens: 142956 main.py:406 INFO INFO:babeldoc.main:Completion tokens: 55501 main.py:407 INFO INFO:babeldoc.high_level:Waiting for translation to finish... high_level.py:323 INFO INFO:babeldoc.document_il.translator.translator:openai translate call translator.py:92 count: 5689 INFO INFO:babeldoc.document_il.translator.translator:openai translate cache translator.py:95 call count: 5449
another issue, it prompts error msg while the file suffix is PDF (capital words), hahaha
In some special cases, BabelDOC cannot clean up its temporary files. If placed in the system's temporary file path, the system can help clean them up.
I will add an option later to allow you to put temporary files in the user's home directory.
Also, the issue with uppercase PDF suffix will be fixed later.
parser.add_argument(
"--working-dir",
default=None,
help="Working directory for translation. If not set, use temp directory.",
)
This parameter has been added, but uppercase .PDF has not been handled yet; will do it later.
大兄弟,我知道是啥问题了
你在合并拆分文件的时候,会先去合并xxxx.mono.pdf,问题是我设置的就压根不生成mono,只生成dual。程序一看没找到mono,就直接中断了.....
[05/15/25 21:50:17] INFO INFO:babeldoc.document_il.backend.pdf_creater:PDF save with clean=False pdf_creater.py:748
completed successfully
INFO INFO:babeldoc.high_level:start merge results high_level.py:569
INFO INFO:babeldoc.high_level:Peak memory usage: 2084.80 MB high_level.py:369
ERROR ERROR:babeldoc.high_level:translate error: no such file: high_level.py:595
'New-distribution-paradigms-for-railway-interlocking\part_0_output\input.
part0.no_watermark.zh-CN.mono.pdf'
INFO INFO:babeldoc.progress_monitor:progress_monitor handle progress_monitor.py:247
translate_error: no such file:
'New-distribution-paradigms-for-railway-interlocking\part_0_output
input.part0.no_watermark.zh-CN.mono.pdf'
ERROR ERROR:babeldoc.main:Error: no such file: main.py:426
'New-distribution-paradigms-for-railway-interlocking\part_0_output\input.part0.
no_watermark.zh-CN.mono.pdf'
大兄弟,我知道是啥问题了
你在合并拆分文件的时候,会先去合并xxxx.mono.pdf,问题是我设置的就压根不生成mono,只生成dual。程序一看没找到mono,就直接中断了.....
[05/15/25 21:50:17] INFO INFO:babeldoc.document_il.backend.pdf_creater:PDF save with clean=False pdf_creater.py:748 completed successfully INFO INFO:babeldoc.high_level:start merge results high_level.py:569 INFO INFO:babeldoc.high_level:Peak memory usage: 2084.80 MB high_level.py:369 ERROR ERROR:babeldoc.high_level:translate error: no such file: high_level.py:595 'New-distribution-paradigms-for-railway-interlocking\part_0_output\input. part0.no_watermark.zh-CN.mono.pdf' INFO INFO:babeldoc.progress_monitor:progress_monitor handle progress_monitor.py:247 translate_error: no such file: 'New-distribution-paradigms-for-railway-interlocking\part_0_output input.part0.no_watermark.zh-CN.mono.pdf' ERROR ERROR:babeldoc.main:Error: no such file: main.py:426 'New-distribution-paradigms-for-railway-interlocking\part_0_output\input.part0. no_watermark.zh-CN.mono.pdf'
原来如此,后续修复
如果我没记错的话,最近的版本修了。