Add Token Classification, Text Summarisation, QA Examples
We can add a few examples:
- Token Classification with BERT Dataset: CoNLL 2003 What's different? Here, we have to classify every word into its NER type. However, since BERT tokenises text into subwords, we will have more tokens than labels (number of labels will be equal to the number of words). So, we have to assign to all subwords the label of the word which spawned it. I think this will give users a good overview of how tokenisation in BERT is done.
- Question Answering with BERT Dataset: SQuAD or CoQA What's different? Here, we have to assign to every token, the probability of it being a starting token and an ending token.
- Text Summarisation (Abstractive) **Dataset: ** CNN/Daily Mail Dataset What's different?: Not much, pretty similar to the NMT example which is already present.
Let me know which tasks here are worth adding, and I will add them. I understand this is a low priority task, but it's taking me a while to understand the tokenizer code in the SentencePiece library. So, I can take this up in the meantime.
These are things we would like to have, but are not things we will work on right now. Before this, we need to figure out our desired story for pretraining with the library (actively under discussion).
Cool, makes sense 👍🏼
Token Classification : #754 SQUAD : #741
How about Text summarisation with BART, I think BART Preprocessor is still WIP. cc: @mattdangerw