llama.cpp
llama.cpp copied to clipboard
Add llama.cpp docker support for non-latin languages
#1649
Thanks for this. Could you tell me if LC_ALL=C.UTF-8 also works? There may be a general solution that works for everyone rather than having a special case for LC_ALL and Chinese.
谢谢您的建议。您能告诉我LC_ALL=C.UTF-8是否也可以工作呢?可能有一个适合所有人的通用解决方案,而不是特别针对LC_ALL和中文设置。
我进行了更多的测试,在我的环境中(Win11+WSL2+Docker),修改以下变量,加载模型后中文都是可以正常使用的。
LC_ALL=zh_CN.utf8
LC_ALL=C.utf8
LANG=zh_CN.utf8
LANG=C.utf8
需要注意的是,所有设置的值必须是系统内所包含的,需要绝对匹配,书写以及格式错误会导致应用到未知的字符集,导致模型无法识别中文。
以下是安装过中文语言包language-pack-zh-hans, language-pack-zh-hant之后,可以选择的变量值。
# locale -a
C
C.utf8
POSIX
zh_CN.utf8
zh_HK.utf8
zh_SG.utf8
zh_TW.utf8
其它非英语、中文语言的支持,也可以使用此方法进行尝试。
I conducted more tests. In my environment (Win11+WSL2+Docker), after modifying the following variables, Chinese can be used normally after loading the model.
LC_ALL=zh_CN.utf8
LC_ALL=C.utf8
LANG=zh_CN.utf8
LANG=C.utf8
It should be noted that the values of all settings must be included in the system, and absolute matching is required. Wrong writing and formatting will lead to the application of unknown character sets, causing the model to fail to recognize Chinese.
The following are the variable values that can be selected after installing the Chinese language pack language-pack-zh-hans, language-pack-zh-hant.
# locale -a
C
C.utf8
POSIX
zh_CN.utf8
zh_HK.utf8
zh_SG.utf8
zh_TW.utf8
For other languages other than English and Chinese, you can also use this method to try.
I'd say we either add this to full.dockerfile, or we make a multilingual.dockerfile that includes all these languages. Either way we should also include this in the dockerfile.
ENV LC_ALL=C.utf8However, I'm still not convinced we couldn't just do:
ENV LC_ALL=C.utf8without the language packs since it's just a console app.
@qingfengfenga Can you try only adding this to the dockerfile and see if Chinese works for you?
ENV LC_ALL=C.utf8我并不完全确信我们不能只通过设置LC_ALL来解决这个问题。您能试着只在Dockerfile中添加这一行看看中文是否能够正常工作吗?
@DannyDaemonic
在我目前所做的测试中,像你所说的那样,不安装任何中文语言包,在Dockerfile中或者在docker容器启动时设置环境变量,中文模型是可以工作的。
而默认字符集(POSIX),以及字符集环境变量设置错误,字符集环境变量设置完毕但是终端没有刷新或者应用失败,都会出现出现#1649中所描述的问题。
(容器中最常见的情况是环境变量没有刷新或者应用失败,虽然locale命令显示应用了正确的字符集,但是终端没有刷新或者应用失败,会使问题具有迷惑性)。
目前还有一个问题没有合理解释,在#1649中,我曾经试过在容器启动时修改字符集环境变量(LC_ALL=C.UTF-8),但是问题依旧存在。可能的解释是,我的测试环境或者操作失误导致的。
基于目前的测试,我想我们可以只修改默认字符集即可支持中文模型,至于其它语言的模型,也需要更多的测试。
In the tests I've done so far, as you said, without installing any Chinese language packs, setting environment variables in the Dockerfile or when the docker container starts, the Chinese model works.
However, if the default character set (POSIX) and character set environment variables are set incorrectly, if the character set environment variables are set but the terminal is not refreshed or the application fails, the problems described in #1649 will occur.
(The most common situation in the container is that the environment variable is not refreshed or the application fails. Although the locale command shows that the correct character set is applied, the terminal is not refreshed or the application fails, which will make the problem confusing).
There is still a problem that has no reasonable explanation. In #1649, I have tried to modify the character set environment variable (LC_ALL=C.UTF-8) when the container starts, but the problem still exists. The possible explanation is that my test environment or operation error caused it.
Based on the current tests, I think we can support the Chinese model by only modifying the default character set. As for models in other languages, more tests are needed.
@qingfengfenga
I think the best solution is to add ENV LC_ALL=C.utf8 to both full and lite dockerfile builds and not add any additional language packs at this time. This should support the languages I've tested and the Chinese you've tested.
As for the last issue, setting LC_ALL after the container starts, is that necessary if use use the ENV dockerfile modification? Also, you mention setting LC_ALL=C.UTF-8, but we've discovered, at least as far as the docker is concerned, we need LC_ALL=C.utf8 (capital specific).
我认为最好的解决方案是在完整和轻量级的Dockerfile构建中都添加ENV LC_ALL=C.utf8,而不是在此时添加任何额外的语言包。这应该支持我测试过的语言以及你测试过的中文。
关于最后一个问题,如果我们在Dockerfile中使用ENV修改,那么在容器启动后设置LC_ALL是否还有必要呢?此外,你提到设置LC_ALL=C.UTF-8,但我们已经发现,至少就Docker而言,我们需要的是LC_ALL=C.utf8(区分大小写)。
@qingfengfenga I think the best solution is to add
ENV LC_ALL=C.utf8to both full and lite dockerfile builds and not add any additional language packs at this time. This should support the languages I've tested and the Chinese you've tested.As for the last issue, setting
LC_ALLafter the container starts, is that necessary if use use theENVdockerfile modification? Also, you mention settingLC_ALL=C.UTF-8, but we've discovered, at least as far as the docker is concerned, we needLC_ALL=C.utf8(capital specific).我认为最好的解决方案是在完整和轻量级的Dockerfile构建中都添加ENV LC_ALL=C.utf8,而不是在此时添加任何额外的语言包。这应该支持我测试过的语言以及你测试过的中文。
关于最后一个问题,如果我们在Dockerfile中使用ENV修改,那么在容器启动后设置LC_ALL是否还有必要呢?此外,你提到设置LC_ALL=C.UTF-8,但我们已经发现,至少就Docker而言,我们需要的是LC_ALL=C.utf8(区分大小写)。
@DannyDaemonic 是的,我同意你的看法。只需要在Dockerfile中修改默认字符集就可以了,这样在启动Docker容器的时候就不需要额外添加环境变量修改字符集了,除非有特殊需求。
关于字符集名称书写格式的问题,不同的系统和语言包可能会带来不同的字符集名称以及书写格式。就目前Dockerfile使用的基础镜像而言(ubuntu:22.04),他是LC_ALL=C.utf8,如果有基础镜像变更,需要注意这一点。
Yes, I agree with your opinion. Simply modify the default character set in the Dockerfile, so that there is no need to add additional environment variables to modify the character set when starting the Docker container, unless there are special requirements.
Regarding the issue of writing format for character set names, different systems and language packs may result in different character set names and writing formats. As for the basic image currently used by Dockerfile (ubuntu: 22.04), it is LC_ ALL=C.utf8, if there are changes to the basic image, this should be noted.