PDFMathTranslate icon indicating copy to clipboard operation
PDFMathTranslate copied to clipboard

Some letters of an English word at the end of the line are separated and moved to the next line.

Open THD0813 opened this issue 10 months ago • 12 comments

image image 2024_compress.pdf Some words at the end of lines, such as "room," "number," "corresponding," "black," "the," and "please," have some letters that are separated and moved to the next line.

THD0813 avatar Jan 14 '25 01:01 THD0813

之后会引入kp算法解决西文排版问题

Byaidu avatar Jan 14 '25 02:01 Byaidu

This is due to a flaw in the current line-breaking algorithm, and the new backend also temporarily has this issue. We have noticed this problem, but it will take some time to resolve. Please be patient. Additionally, for Chinese-to-English translation scenarios, we are also experimenting with dynamic line spacing and dynamic font sizes to optimize the results.

awwaawwa avatar Jan 14 '25 02:01 awwaawwa

还有一种方案,缩小译文的字号,让他不会溢出,就像这样 image image 如何?

xxnuo avatar Jan 15 '25 10:01 xxnuo

New backend has been implemented

awwaawwa avatar Jan 15 '25 10:01 awwaawwa

New backend has been implemented

Will it be merged into this repository?

xxnuo avatar Jan 15 '25 10:01 xxnuo

It is preliminarily estimated that an experimental integration of pdf2zh may be attempted tomorrow, but the new backend currently has numerous bugs and is temporarily unusable.

awwaawwa avatar Jan 15 '25 10:01 awwaawwa

The new backend is an independent project, currently designed to serve as a document translation backend. It has almost completely rewritten the parsing, translation, and typesetting components of pdf2zh, involving a significant amount of work. Therefore, the related implementation code will not be merged into this repository. However, it was considered from the outset to be used as a backend for pdf2zh, such as ensuring the translator is compatible with pdf2zh's translator API, making it relatively easy to integrate into pdf2zh.

awwaawwa avatar Jan 15 '25 10:01 awwaawwa

The new backend is not intended to be directly used by end-users, so it will not support many translators (currently supports Google, Bing, OpenAI). The supported translators are mainly for debugging convenience. The new backend will also not directly provide a web UI. We hope that end-users will directly use pdf2zh, and additional translator support will be implemented within pdf2zh itself.

awwaawwa avatar Jan 15 '25 10:01 awwaawwa

I see that a new backend has been implemented to address the issue of letter separation in English (#462). Is this fix already available in any version of pdf2zh? If not, is there a way to test or manually integrate it into the current version?

andygeek avatar Feb 27 '25 05:02 andygeek

I was wrong in my previous comment. The new backend has only implemented dynamic scaling for now. The KP algorithm has not been implemented yet. This fix will be available in pdf2zh 2.0.

BabelDOC can be used directly, with a simple CLI. https://github.com/funstory-ai/BabelDOC

awwaawwa avatar Feb 27 '25 05:02 awwaawwa

This happens also in Spanish. It happens quite often, words are cut.

Image

pdf2zh 2.0 does a bit better, but unfortunately it doesn't justify text (lines ends aligned to both left and right margins).

Image

igarca avatar Sep 28 '25 01:09 igarca

@igarca, please post a new issue in 2.0 repo. : ) This repo's issue is not active.

hellofinch avatar Sep 28 '25 09:09 hellofinch