bilingual_book_maker icon indicating copy to clipboard operation
bilingual_book_maker copied to clipboard

Cumulative translation

Open hleft opened this issue 1 year ago • 18 comments

Wait for how many tokens have been accumulated before starting the translation

--accumulated_num num

The strange part: I don't know how to judge how the cumulative translation results correspond to the original paragraph, so use sep = "\n\n\n\n\n", if sep="\n\n", it is often translated into the same line

A possible bug is that even if \n\n\n\n\n, it may still be translated to the same line, and then there will be problems where the translation appears, this will only affect the last num characters, after which it will return to normal

A good effect is that it can understand the context more, and the prompt is more effective

I have used various prompts before to ask it not to translate people's names, but to no avail, with context it works(It will most likely work, and in a few cases it will still translate, but it can be translated to the same name)

name2

left:

time OPENAI_API_SYS_MSG="You are an assistant who translates computer technology books but don't translate people's names, there is a blank line after each of your translation results" python3 "make_book.py" --book_name "$filepath" --openai_key "${openai_apikey_book}" --language "zh-hans" --accumulated_num 1500

right:

time python3 "make_book.py" --book_name "$filepath" --openai_key "${openai_apikey_book}" --language "zh-hans"

hleft avatar Mar 11 '23 13:03 hleft

Will take a look later for this PR.

yihong0618 avatar Mar 12 '23 06:03 yihong0618

the function part let me do it is OK, when I got time

yihong0618 avatar Mar 12 '23 06:03 yihong0618

@yihong0618 Ok, I'm trying to get a better prompt so that it can translate the same number of paragraphs as the original paragraphs, and still often have problems

hleft avatar Mar 12 '23 07:03 hleft

@hleft FYI, I want to keep the original prompt as default because of that it costs the least tokens cc @ConanChou, most users do not care how good of all, they just want it to help them read, so users can DIY their prompt but we do not change the default one unless we found a better one better for translate and for token cost.

yihong0618 avatar Mar 12 '23 07:03 yihong0618

@yihong0618 I mean only modify the default prompt when --accumulated_num is enabled, because if don't modify it, you will get an error and not just a bad translation (the translated result appears in the wrong place).

hleft avatar Mar 12 '23 07:03 hleft

yep thanks~

yihong0618 avatar Mar 12 '23 07:03 yihong0618

Still in testing, the current main progress is that I understand that it is not enough to just modify the prompt and retry, and some special cases must be manually handled, such as <sup>, Source: xxx link

hleft avatar Mar 12 '23 10:03 hleft

A useful modification is to accumulate tokens instead of characters. If you want to go from English to Simplified Chinese, I think If max limit is 4096 , --accumulated_num 1600

hleft avatar Mar 13 '23 07:03 hleft

Now it works fine for me I use retry=15 to translate a 300-page epub, only three times there is an error in the order problem size: 2,1,1 problem size means how far the translation appears and how far it should appear, 1 means not much impact, and problem size 2 there is the problem of api. can more retry

hleft avatar Mar 14 '23 03:03 hleft

This pr has the following effects when not using --accumulated_num

It generates the result file every time a file is translated, instead of waiting until after ctrl-c, which I think makes more sense, as I often want to see the result while translating instead of waiting for the break (It will add the part that does not need to be translated, and then write it after translation, so that frequent writing will not affect performance)

The output increases the translation time and the token consumed

I want to merge first, clean and refactor later, what do you think? @yihong0618

hleft avatar Mar 14 '23 10:03 hleft

for me: I don't really want to break the original rules And what I am thinking is to keep the old way new ways let us add args/

And a little busy for work...maybe take a look tomorrow.

yihong0618 avatar Mar 14 '23 10:03 yihong0618

And about token part we have #106 now..

yihong0618 avatar Mar 14 '23 10:03 yihong0618

all function names better follow pep8 https://peps.python.org/pep-0008/

isFigure -> is_figure is better others the same

yihong0618 avatar Mar 14 '23 10:03 yihong0618

@zstone12 do you have time help me to take a look?

yihong0618 avatar Mar 14 '23 11:03 yihong0618

@yihong0618

I did rename and restore the way the books are written, now the only difference when not using --accumulated_num is that it outputs the token and time

I looked at https://github.com/yihong0618/bilingual_book_maker/pull/106, there seems to be a problem with this logic.

openai's limit for token is input+output, while https://github.com/yihong0618/bilingual_book_maker/pull/106 seems to consider it as input only.

if get

except Exception:
		if completion["choices"][0]["finish_reason"] != "length":

then the maximum length is reached

Because it is not possible to know if the limit of 4090 is reached before sending, --accumulated_num num would suggest using 1600, there may be 2200 tokens in the translated content. and the system messages and user prompt tokens will almost reach the limit.

hleft avatar Mar 14 '23 11:03 hleft

@zstone12 do you have time help me to take a look? @yihong0618 I am also a bit busy with work, I will take a look later when I have some time.

zstone12 avatar Mar 14 '23 12:03 zstone12

LGTM+1. The fantastic work!

zstone12 avatar Mar 15 '23 12:03 zstone12

Please add the relevant content of the --accumulated_num parameter to README:D

zstone12 avatar Mar 15 '23 12:03 zstone12

can not test... image

yihong0618 avatar Mar 16 '23 12:03 yihong0618

@hleft can this merge now?

yihong0618 avatar Mar 16 '23 13:03 yihong0618

@yihong0618 yes ready

hleft avatar Mar 16 '23 13:03 hleft

please check new CI if it failed we need to fix it quick

yihong0618 avatar Mar 16 '23 13:03 yihong0618