turjuman icon indicating copy to clipboard operation
turjuman copied to clipboard

Token ids generated instead of translation

Open ahmedoumar opened this issue 2 years ago • 6 comments

Hey there, I hope you're doing fine. when running the command: turj.translate it returns the token ids instead of the actual translation? (see the output below) 2022-07-07 10:41:43 | INFO | turjuman.translate | Using beam search tensor([[ 0, 6538, 2, 76, 6380, 1]])

ahmedoumar avatar Jul 07 '22 11:07 ahmedoumar

Hi Ahmed, could you please provide us with more details such as your input sentence and screenshot? Thanks

elmadany avatar Jul 07 '22 17:07 elmadany

Screenshot from 2022-07-07 12-02-16 as you can see the turj.translate returns output ids instead of translation, i have solved this by using the tokenizer and then decode the ids back to tokens: tokenizer.decode(target, skip_special_tokens=True, clean_up_tokenization_spaces=True)

ahmedoumar avatar Jul 07 '22 17:07 ahmedoumar

To integrate Turjuman with your python code, take a look at this notebook. https://colab.research.google.com/github/UBC-NLP/turjuman/blob/main/examples/Integrate_turjuman_with_your_code.ipynb Thanks

elmadany avatar Jul 07 '22 17:07 elmadany

when you run that notebook, you get only the target ids, as shown in the screenshot.

ahmedoumar avatar Jul 07 '22 17:07 ahmedoumar

Thanks Ahmed, we will check this soon

elmadany avatar Jul 07 '22 17:07 elmadany

quick fix result = torj.tokenizer.batch_decode(target, skip_special_tokens=True)

kabapy avatar Sep 11 '22 19:09 kabapy