basaran icon indicating copy to clipboard operation
basaran copied to clipboard

in stream mode, the English word has no space after detokenizer and Chinese were messed up

Open lucasjinreal opened this issue 2 years ago • 14 comments

image

How to resolve this problem?>

lucasjinreal avatar Jun 01 '23 08:06 lucasjinreal

Hi @lucasjinreal. We need more information in order to assist you in resolving the issue.

May I ask which model you are using? Are you using it through the API or through Python?

peakji avatar Jun 01 '23 08:06 peakji

@peakji Ithink its not related about model. For model am simple using Llama.

The reason is that when we decode same id, compare with decode ids in a sentence, tokenizers can be different.

For instance, for ids: [34, 56, 656], tokenizers would decode like: I love u

But if you decode one by one, you will got: Iloveu

It doesn't presever these spaces, and Chinese characters even worse.

However, am not sure is because of this or not for real.

But above is the problems I have indeed.

What's your think? (Mainly simple word do not have spaces compare as original, and Chinese if wrong decoding)

lucasjinreal avatar Jun 01 '23 09:06 lucasjinreal

Or maybe these is something missed inside your StreamTokenizer? (like ignored some ids). Can u try get decode ids one by one and print it?

outputs = []
          for oid in output_ids:
              # if i > len(input_ids[0]):
              # print(oid)
              word = tokenizer.decode(oid[0])
              print(word, end='')
              outputs.append(word)
              # else:
              #     i += 1
          print()
          outputs = ''.join(outputs)

Me was wrong

lucasjinreal avatar Jun 01 '23 09:06 lucasjinreal

The reason is that when we decode same id, compare with decode ids in a sentence, tokenizers can be different.

StreamTokenizer is specifically designed to handle this properly.

There is an example of the LLaMA tokenizer in the test case, which also includes Chinese characters:

https://github.com/hyperonym/basaran/blob/master/tests/test_tokenizer.py#L48

peakji avatar Jun 01 '23 10:06 peakji

@peakji Thanks, I just using tokenizer of StreamModel and the Chinese decoding error problems still exist.

image

And I still can not get the spaces between engliesh words .

I think the output stream has some problems, How can I combine it using with model and tokenizer and print correct words in terminal?

lucasjinreal avatar Jun 01 '23 10:06 lucasjinreal

Here's a simple example for using Basaran as a Python library: https://github.com/hyperonym/basaran/blob/master/examples/basaran-python-library/main.py

peakji avatar Jun 02 '23 00:06 peakji

I got no space and Chinese were wrong either (try print(word, end=''))

I don't want change line in every word and I don't want unexpect spaces in un-English characters.

lucasjinreal avatar Jun 02 '23 03:06 lucasjinreal

Could you please provide some example code for us to reproduce the issue?

The output in your first screenshot is apparently not from StreamTokenizer.

peakji avatar Jun 02 '23 03:06 peakji

@peakji second one is, I fisrt usign model = StreamModel(model, tokeizer) and then using model.tokenizer to decode.

Can u guys provide a effect print correct values without change line demo? (correctly all print word one by one

lucasjinreal avatar Jun 02 '23 05:06 lucasjinreal

I fisrt usign model = StreamModel(model, tokeizer) and then using model.tokenizer to decode.

You shouldn't use model.tokenizer directly because it's not a stateful StreamTokenizer but a stateless Huggingface tokenizer.

The correct way could be either:

a. Call the model directly without the need for manual detokenization: https://github.com/hyperonym/basaran/blob/5ef5ef006b2acd59d0512409e41e693b142aef66/examples/basaran-python-library/main.py#L8-L9 b. Create an instance of StreamTokenizer and use that instead: https://github.com/hyperonym/basaran/blob/5ef5ef006b2acd59d0512409e41e693b142aef66/tests/test_tokenizer.py#L54-L61

peakji avatar Jun 02 '23 05:06 peakji

@peakji thank u! I have solved the first problem.

the english seems oK now. but Chinese still not OK image

image

the Chinese characters some are ok, some still got weird coding style

lucasjinreal avatar Jun 02 '23 06:06 lucasjinreal

Some \n which is actually needed seems trimed:

image

lucasjinreal avatar Jun 02 '23 06:06 lucasjinreal

I resolved the \n issue, but clearly the Chinese not always work:

image

Please take a deeper test!

lucasjinreal avatar Jun 02 '23 06:06 lucasjinreal

We need more information to assist you in resolving the issue. These screenshots alone don't provide much valuable information.

Could you please provide the code you are testing for us to reproduce?

peakji avatar Jun 02 '23 09:06 peakji