whisper.cpp
whisper.cpp copied to clipboard
command app for chinese
I use the examples/command app. If if define the comand.txt in Chinese, e.g
打开
关闭
In the guild mode, it will core dump.
The output.
process_command_list: allowed commands [ tokens ]:
- 打开 = [ ]
- 关闭 = [ ]
- 加热 = [ ]
- 停止 = [ ]
process_command_list: prompt: 'select one from the available words: 打开, 关闭, 加热, 停止. selected word: '
process_command_list: tokens: [ 790 557 472 490 264 2435 2283 25 220 12467 18937 11 220 28053 8259 255 11 220 9990 23661 255 11 220 36135 30438 13 8209 1349 25 220 ]
The problem is the tokens are empty. Why
- 打开 = [ ]
I guess the utf8 split problem, so I made a changed
std::vector<std::string> split_utf8(std::string s) {
std::vector<std::string> t;
for (size_t i = 0; i < s.length();)
{
int cplen = 1;
// 以下的几个if,要参考这里 https://en.wikipedia.org/wiki/UTF-8#Description
if ((s[i] & 0xf8) == 0xf0) // 11111000, 11110000
cplen = 4;
else if ((s[i] & 0xf0) == 0xe0) // 11100000
cplen = 3;
else if ((s[i] & 0xe0) == 0xc0) // 11000000
cplen = 2;
if ((i + cplen) > s.length())
cplen = 1;
t.push_back(s.substr(i, cplen));
i += cplen;
}
return t;
}
In process_command_list, make some changed,
for (const auto & cmd : allowed_commands) {
whisper_token tokens[1024];
allowed_tokens.emplace_back();
std::vector<std::string> t = split_utf8(cmd);
for (int l = 0; l < (int) t.size(); ++l) {
// NOTE: very important to add the whitespace !
// the reason is that the first decoded token starts with a whitespace too!
std::string ss = std::string(" ");
for (auto i=0; i<l+1;i++)
{
ss = ss+t[i];
}
const int n = whisper_tokenize(ctx, ss.c_str(), tokens, 1024);
if (n < 0) {
fprintf(stderr, "%s: error: failed to tokenize command '%s'\n", __func__, cmd.c_str());
return 3;
}
if (n == 1) {
allowed_tokens.back().push_back(tokens[0]);
}
}
max_len = std::max(max_len, (int) cmd.size());
}
but the Chinese token output still empty.
Where is the problem?
Which language have you selected? The default is English.
I use -l zh .
./command -m ./models/ggml-small.bin -l zh -t 8 -cmd ./examples/command/test.txt
Any progress?