keras-nlp
keras-nlp copied to clipboard
`char_to_token` in `keras_nlp.tokenizers.Tokenizer`
char_to_token is a method that converts the character index to the token index. See the HuggingFace method here
This is useful in span classification tasks, such as SQuaD, as we need to know what is the index of start token and end token of the answer span. For example, if given:
- context: "Suggest an idea for this project. If this doesn’t look right, choose a different type."
- answer: "an idea"
Then we have the following character index:
- start_index: 9
- end_index: 15
Assume we use the basic white space tokenizer, then in token space, we have:
- start_index: 2
- end_index: 3
We need an API to do such conversion.
A side note: char_to_token is a strange name, let's think what could be a better one.
@chenmoneygithub So the API would take in the context, answer and a splitting scheme? and the task would be to find a the matching subarray of the split answer tokens in the split context tokens?
Yes, the goal is to know which token is the start token and which one is the end token.
I can work on this if no one else is taking this up!
As the issue is gone stale, can I take this up?
@shivance, I just realised that this is something @TheAthleticCoder is working on (because he is working on the SQuaD example). Can you confirm, @TheAthleticCoder?
Yes! I am working on the SQuAD example. It is similar to this.
Ah, okay. Thanks for confirming!
@chenmoneygithub Link: https://colab.research.google.com/drive/100XEO_quSz1meTDSAhojLNPoHdGTFnyO?usp=sharing Taking the inputs as a context, answer and using the squad dataset as an example, I have returned the start and end indices as a list of all possible occurring indices of the answer in the context. As well as, if the answer doesn't actually exist, append -1 to the list of indices.
If this approach seems alright, we can adapt it and modify it further. Do let me know!
@chenmoneygithub a gentle reminder for review