keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

Add Token Classification, Text Summarisation, QA Examples

Open abheesht17 opened this issue 3 years ago • 3 comments

We can add a few examples:

  • Token Classification with BERT Dataset: CoNLL 2003 What's different? Here, we have to classify every word into its NER type. However, since BERT tokenises text into subwords, we will have more tokens than labels (number of labels will be equal to the number of words). So, we have to assign to all subwords the label of the word which spawned it. I think this will give users a good overview of how tokenisation in BERT is done.
  • Question Answering with BERT Dataset: SQuAD or CoQA What's different? Here, we have to assign to every token, the probability of it being a starting token and an ending token.
  • Text Summarisation (Abstractive) **Dataset: ** CNN/Daily Mail Dataset What's different?: Not much, pretty similar to the NMT example which is already present.

Let me know which tasks here are worth adding, and I will add them. I understand this is a low priority task, but it's taking me a while to understand the tokenizer code in the SentencePiece library. So, I can take this up in the meantime.

abheesht17 avatar Mar 20 '22 16:03 abheesht17

These are things we would like to have, but are not things we will work on right now. Before this, we need to figure out our desired story for pretraining with the library (actively under discussion).

mattdangerw avatar Mar 21 '22 18:03 mattdangerw

Cool, makes sense 👍🏼

abheesht17 avatar Mar 21 '22 18:03 abheesht17

Token Classification : #754 SQUAD : #741

How about Text summarisation with BART, I think BART Preprocessor is still WIP. cc: @mattdangerw

kanpuriyanawab avatar Feb 25 '23 19:02 kanpuriyanawab