keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

`char_to_token` in `keras_nlp.tokenizers.Tokenizer`

Open chenmoneygithub opened this issue 2 years ago • 9 comments

char_to_token is a method that converts the character index to the token index. See the HuggingFace method here

This is useful in span classification tasks, such as SQuaD, as we need to know what is the index of start token and end token of the answer span. For example, if given:

  • context: "Suggest an idea for this project. If this doesn’t look right, choose a different type."
  • answer: "an idea"

Then we have the following character index:

  • start_index: 9
  • end_index: 15

Assume we use the basic white space tokenizer, then in token space, we have:

  • start_index: 2
  • end_index: 3

We need an API to do such conversion.

A side note: char_to_token is a strange name, let's think what could be a better one.

chenmoneygithub avatar Aug 07 '22 00:08 chenmoneygithub

@chenmoneygithub So the API would take in the context, answer and a splitting scheme? and the task would be to find a the matching subarray of the split answer tokens in the split context tokens?

aflah02 avatar Aug 08 '22 07:08 aflah02

Yes, the goal is to know which token is the start token and which one is the end token.

chenmoneygithub avatar Aug 08 '22 18:08 chenmoneygithub

I can work on this if no one else is taking this up!

aflah02 avatar Aug 08 '22 18:08 aflah02

As the issue is gone stale, can I take this up?

shivance avatar Feb 25 '23 18:02 shivance

@shivance, I just realised that this is something @TheAthleticCoder is working on (because he is working on the SQuaD example). Can you confirm, @TheAthleticCoder?

abheesht17 avatar Feb 26 '23 07:02 abheesht17

Yes! I am working on the SQuAD example. It is similar to this.

TheAthleticCoder avatar Feb 26 '23 08:02 TheAthleticCoder

Ah, okay. Thanks for confirming!

abheesht17 avatar Feb 27 '23 12:02 abheesht17

@chenmoneygithub Link: https://colab.research.google.com/drive/100XEO_quSz1meTDSAhojLNPoHdGTFnyO?usp=sharing Taking the inputs as a context, answer and using the squad dataset as an example, I have returned the start and end indices as a list of all possible occurring indices of the answer in the context. As well as, if the answer doesn't actually exist, append -1 to the list of indices.

If this approach seems alright, we can adapt it and modify it further. Do let me know!

TheAthleticCoder avatar Mar 12 '23 17:03 TheAthleticCoder

@chenmoneygithub a gentle reminder for review

TheAthleticCoder avatar Mar 17 '23 15:03 TheAthleticCoder