bilingual_book_maker
bilingual_book_maker copied to clipboard
Cumulative translation
Wait for how many tokens have been accumulated before starting the translation
--accumulated_num num
The strange part: I don't know how to judge how the cumulative translation results correspond to the original paragraph, so use sep = "\n\n\n\n\n"
, if sep="\n\n"
, it is often translated into the same line
A possible bug is that even if \n\n\n\n\n,
it may still be translated to the same line, and then there will be problems where the translation appears, this will only affect the last num characters, after which it will return to normal
A good effect is that it can understand the context more, and the prompt is more effective
I have used various prompts before to ask it not to translate people's names, but to no avail, with context it works(It will most likely work, and in a few cases it will still translate, but it can be translated to the same name)
left:
time OPENAI_API_SYS_MSG="You are an assistant who translates computer technology books but don't translate people's names, there is a blank line after each of your translation results" python3 "make_book.py" --book_name "$filepath" --openai_key "${openai_apikey_book}" --language "zh-hans" --accumulated_num 1500
right:
time python3 "make_book.py" --book_name "$filepath" --openai_key "${openai_apikey_book}" --language "zh-hans"
Will take a look later for this PR.
the function part let me do it is OK, when I got time
@yihong0618 Ok, I'm trying to get a better prompt so that it can translate the same number of paragraphs as the original paragraphs, and still often have problems
@hleft FYI, I want to keep the original prompt as default because of that it costs the least tokens cc @ConanChou, most users do not care how good of all, they just want it to help them read, so users can DIY their prompt but we do not change the default one unless we found a better one better for translate and for token cost
.
@yihong0618 I mean only modify the default prompt when --accumulated_num
is enabled, because if don't modify it, you will get an error and not just a bad translation (the translated result appears in the wrong place).
yep thanks~
Still in testing, the current main progress is that I understand that it is not enough to just modify the prompt and retry, and some special cases must be manually handled, such as <sup>
, Source: xxx link
A useful modification is to accumulate tokens instead of characters. If you want to go from English to Simplified Chinese, I think If max limit is 4096 , --accumulated_num 1600
Now it works fine for me I use retry=15 to translate a 300-page epub, only three times there is an error in the order problem size: 2,1,1 problem size means how far the translation appears and how far it should appear, 1 means not much impact, and problem size 2 there is the problem of api. can more retry
This pr has the following effects when not using --accumulated_num
It generates the result file every time a file is translated, instead of waiting until after ctrl-c, which I think makes more sense, as I often want to see the result while translating instead of waiting for the break (It will add the part that does not need to be translated, and then write it after translation, so that frequent writing will not affect performance)
The output increases the translation time and the token consumed
I want to merge first, clean and refactor later, what do you think? @yihong0618
for me: I don't really want to break the original rules And what I am thinking is to keep the old way new ways let us add args/
And a little busy for work...maybe take a look tomorrow.
And about token part we have #106 now..
all function names better follow pep8 https://peps.python.org/pep-0008/
isFigure -> is_figure is better others the same
@zstone12 do you have time help me to take a look?
@yihong0618
I did rename and restore the way the books are written, now the only difference when not using --accumulated_num
is that it outputs the token and time
I looked at https://github.com/yihong0618/bilingual_book_maker/pull/106, there seems to be a problem with this logic.
openai's limit for token is input+output, while https://github.com/yihong0618/bilingual_book_maker/pull/106 seems to consider it as input only.
if get
except Exception:
if completion["choices"][0]["finish_reason"] != "length":
then the maximum length is reached
Because it is not possible to know if the limit of 4090 is reached before sending,
--accumulated_num num
would suggest using 1600, there may be 2200 tokens in the translated content.
and the system messages and user prompt tokens will almost reach the limit.
@zstone12 do you have time help me to take a look? @yihong0618 I am also a bit busy with work, I will take a look later when I have some time.
LGTM+1. The fantastic work!
Please add the relevant content of the --accumulated_num
parameter to README:D
can not test...
@hleft can this merge now?
@yihong0618 yes ready
please check new CI if it failed we need to fix it quick