djl icon indicating copy to clipboard operation
djl copied to clipboard

Decoder method for huggingface tokenizer

Open mehmetcalikus opened this issue 2 years ago • 7 comments

Hello,

I use the encoding and tokenization parts of the Huggingface Tokenizer Class, it works exactly the same as in python. But when using bert model as a language model in word suggestion and spelling correction service, I need tokenizer's decoder method. At this point, I could not see a decoder method for the Hugginface tokenizer. Do you have any suggestions how to use a decoder for Huggingface tokenizer?

Thanks in advance.

mehmetcalikus avatar Jul 25 '22 10:07 mehmetcalikus

You don't really need decode function. When you call tokenizer.encode(), an Encoding object is returned, you can call encoding.getToken() to get the tokens back.

And then you call: tokenizer.buildSentence() to get the final result.

frankfliu avatar Jul 25 '22 21:07 frankfliu

The problem with the method you mentioned is need to know beforehand what text is in order to use it. But I do not know what the text is. I'm using bert's language model head for word suggestion instead of masked token.

I need the text itself to define the encoding object you said. Encoding encoding = tokenizer.encode("Some strings"). But I don't know what "Some strings" is. I am trying to predict that tokens from the prediction(integer list is consisting of token ids) that the model gives me. So your method doesn't work for me.

I need something with similar logic to tokenizer.decode(token_ids) in python. Do you have any suggestions to do this in djl? I searched the library a lot but couldn't find it.

mehmetcalikus avatar Jul 28 '22 11:07 mehmetcalikus

@mehmetcalikus I agree, we should expose this API. We are working on expose more tokenizer API:

  1. decode
  2. setPadding
  3. setTruncation

frankfliu avatar Jul 28 '22 17:07 frankfliu

Hi @mehmetcalikus - I'm currently working on exposing decode and batch_decode functions in java. Will keep this thread updated.

siddvenk avatar Jul 28 '22 17:07 siddvenk

@mehmetcalikus The tokenizer.decode(...) method has been added to DJL. You should be able to pull it from the latest SNAPSHOT.

Let me know if you have any issues using it.

siddvenk avatar Aug 01 '22 16:08 siddvenk

Hi @siddvenk firstly thanks for your quick reply,

I updated the version of the ai.djl package I used with the one in here. In addition, I added the huggingface as below to the pom.xml in the spring project.

<dependency>
	<groupId>ai.djl.huggingface</groupId>
	<artifactId>tokenizers</artifactId>
	<version>0.18.0-SNAPSHOT</version>
</dependency>

But the decode method is not added to the library. How can I solve this problem?

mehmetcalikus avatar Aug 03 '22 12:08 mehmetcalikus

The changes I made are only available in version 0.19.0-SNAPSHOT. Try updating to that version and let me know if that works.

siddvenk avatar Aug 03 '22 15:08 siddvenk

Feel free to reopen the issue if you still have questions.

frankfliu avatar Aug 22 '22 21:08 frankfliu