basaran in stream mode, the English word has no space after detokenizer and Chinese were messed up

How to resolve this problem?>

Jun 01 '23 08:06 lucasjinreal

Hi @lucasjinreal. We need more information in order to assist you in resolving the issue.

May I ask which model you are using? Are you using it through the API or through Python?

Jun 01 '23 08:06 peakji

@peakji Ithink its not related about model. For model am simple using Llama.

The reason is that when we decode same id, compare with decode ids in a sentence, tokenizers can be different.

For instance, for ids: [34, 56, 656], tokenizers would decode like: I love u

But if you decode one by one, you will got: Iloveu

It doesn't presever these spaces, and Chinese characters even worse.

However, am not sure is because of this or not for real.

But above is the problems I have indeed.

What's your think? (Mainly simple word do not have spaces compare as original, and Chinese if wrong decoding)

Jun 01 '23 09:06 lucasjinreal

Or maybe these is something missed inside your StreamTokenizer? (like ignored some ids). Can u try get decode ids one by one and print it?

outputs = []
          for oid in output_ids:
              # if i > len(input_ids[0]):
              # print(oid)
              word = tokenizer.decode(oid[0])
              print(word, end='')
              outputs.append(word)
              # else:
              #     i += 1
          print()
          outputs = ''.join(outputs)

Me was wrong

Jun 01 '23 09:06 lucasjinreal

The reason is that when we decode same id, compare with decode ids in a sentence, tokenizers can be different.

StreamTokenizer is specifically designed to handle this properly.

There is an example of the LLaMA tokenizer in the test case, which also includes Chinese characters:

https://github.com/hyperonym/basaran/blob/master/tests/test_tokenizer.py#L48

Jun 01 '23 10:06 peakji

@peakji Thanks, I just using tokenizer of StreamModel and the Chinese decoding error problems still exist.

And I still can not get the spaces between engliesh words .

I think the output stream has some problems, How can I combine it using with model and tokenizer and print correct words in terminal?

Jun 01 '23 10:06 lucasjinreal

Here's a simple example for using Basaran as a Python library: https://github.com/hyperonym/basaran/blob/master/examples/basaran-python-library/main.py

Jun 02 '23 00:06 peakji

I got no space and Chinese were wrong either (try print(word, end=''))

I don't want change line in every word and I don't want unexpect spaces in un-English characters.

Jun 02 '23 03:06 lucasjinreal

Could you please provide some example code for us to reproduce the issue?

The output in your first screenshot is apparently not from StreamTokenizer.

Jun 02 '23 03:06 peakji

@peakji second one is, I fisrt usign model = StreamModel(model, tokeizer) and then using model.tokenizer to decode.

Can u guys provide a effect print correct values without change line demo? (correctly all print word one by one

Jun 02 '23 05:06 lucasjinreal

I fisrt usign model = StreamModel(model, tokeizer) and then using model.tokenizer to decode.

You shouldn't use model.tokenizer directly because it's not a stateful StreamTokenizer but a stateless Huggingface tokenizer.

The correct way could be either:

a. Call the model directly without the need for manual detokenization: https://github.com/hyperonym/basaran/blob/5ef5ef006b2acd59d0512409e41e693b142aef66/examples/basaran-python-library/main.py#L8-L9 b. Create an instance of StreamTokenizer and use that instead: https://github.com/hyperonym/basaran/blob/5ef5ef006b2acd59d0512409e41e693b142aef66/tests/test_tokenizer.py#L54-L61

Jun 02 '23 05:06 peakji

@peakji thank u! I have solved the first problem.

the english seems oK now. but Chinese still not OK