whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

command app for chinese

Open joyhope opened this issue 1 year ago • 4 comments

I use the examples/command app. If if define the comand.txt in Chinese, e.g

打开
关闭

In the guild mode, it will core dump.

The output.

process_command_list: allowed commands [ tokens ]:

  - 打开 = [ ]
  - 关闭 = [ ]
  - 加热 = [ ]
  - 停止 = [ ]

process_command_list: prompt: 'select one from the available words: 打开, 关闭, 加热, 停止. selected word: '
process_command_list: tokens: [ 790 557 472 490 264 2435 2283 25 220 12467 18937 11 220 28053 8259 255 11 220 9990 23661 255 11 220 36135 30438 13 8209 1349 25 220 ]

The problem is the tokens are empty. Why

  - 打开 = [ ]

joyhope avatar Feb 21 '24 06:02 joyhope

I guess the utf8 split problem, so I made a changed

std::vector<std::string> split_utf8(std::string s) {
    std::vector<std::string> t;
    for (size_t i = 0; i < s.length();)
    {
        int cplen = 1;
        // 以下的几个if,要参考这里 https://en.wikipedia.org/wiki/UTF-8#Description
        if ((s[i] & 0xf8) == 0xf0)      // 11111000, 11110000
            cplen = 4;
        else if ((s[i] & 0xf0) == 0xe0) // 11100000
            cplen = 3;
        else if ((s[i] & 0xe0) == 0xc0) // 11000000
            cplen = 2;
        if ((i + cplen) > s.length())
            cplen = 1;
        t.push_back(s.substr(i, cplen));
        i += cplen;
    }
    return t;
}

In process_command_list, make some changed,

    for (const auto & cmd : allowed_commands) {
        whisper_token tokens[1024];
        allowed_tokens.emplace_back();

        std::vector<std::string> t = split_utf8(cmd);

        for (int l = 0; l < (int) t.size(); ++l) {
            // NOTE: very important to add the whitespace !
            //       the reason is that the first decoded token starts with a whitespace too!
            std::string ss = std::string(" ");
            for (auto i=0; i<l+1;i++)
            {
              ss = ss+t[i];
            }

            const int n = whisper_tokenize(ctx, ss.c_str(), tokens, 1024);
            if (n < 0) {
                fprintf(stderr, "%s: error: failed to tokenize command '%s'\n", __func__, cmd.c_str());
                return 3;
            }
            if (n == 1) {
                allowed_tokens.back().push_back(tokens[0]);
            }
        }

        max_len = std::max(max_len, (int) cmd.size());
    }

but the Chinese token output still empty.

Where is the problem?

joyhope avatar Feb 21 '24 07:02 joyhope

Which language have you selected? The default is English.

bobqianic avatar Feb 21 '24 20:02 bobqianic

I use -l zh .

./command -m ./models/ggml-small.bin -l zh -t 8 -cmd ./examples/command/test.txt

joyhope avatar Feb 22 '24 04:02 joyhope

Any progress?

qiangxinglin avatar Mar 30 '24 13:03 qiangxinglin