djl
djl copied to clipboard
Decoder method for huggingface tokenizer
Hello,
I use the encoding and tokenization parts of the Huggingface Tokenizer Class, it works exactly the same as in python. But when using bert model as a language model in word suggestion and spelling correction service, I need tokenizer's decoder method. At this point, I could not see a decoder method for the Hugginface tokenizer. Do you have any suggestions how to use a decoder for Huggingface tokenizer?
Thanks in advance.
You don't really need decode function. When you call tokenizer.encode()
, an Encoding
object is returned, you can call encoding.getToken()
to get the tokens back.
And then you call: tokenizer.buildSentence()
to get the final result.
The problem with the method you mentioned is need to know beforehand what text is in order to use it. But I do not know what the text is. I'm using bert's language model head for word suggestion instead of masked token.
I need the text itself to define the encoding object you said. Encoding encoding = tokenizer.encode("Some strings")
. But I don't know what "Some strings" is. I am trying to predict that tokens from the prediction(integer list is consisting of token ids) that the model gives me. So your method doesn't work for me.
I need something with similar logic to tokenizer.decode(token_ids)
in python. Do you have any suggestions to do this in djl? I searched the library a lot but couldn't find it.
@mehmetcalikus I agree, we should expose this API. We are working on expose more tokenizer API:
- decode
- setPadding
- setTruncation
Hi @mehmetcalikus - I'm currently working on exposing decode and batch_decode functions in java. Will keep this thread updated.
@mehmetcalikus The tokenizer.decode(...) method has been added to DJL. You should be able to pull it from the latest SNAPSHOT.
Let me know if you have any issues using it.
Hi @siddvenk firstly thanks for your quick reply,
I updated the version of the ai.djl package I used with the one in here. In addition, I added the huggingface as below to the pom.xml in the spring project.
<dependency>
<groupId>ai.djl.huggingface</groupId>
<artifactId>tokenizers</artifactId>
<version>0.18.0-SNAPSHOT</version>
</dependency>
But the decode method is not added to the library. How can I solve this problem?
The changes I made are only available in version 0.19.0-SNAPSHOT
. Try updating to that version and let me know if that works.
Feel free to reopen the issue if you still have questions.