keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

Add a SQuAD example

Open mattdangerw opened this issue 2 years ago • 12 comments

This is a two part issue, which will be a large time investment.

First, we would like to build a squad evaluation example, in /examples/squad_benchmark, based on our exampling in /examples/glue_benchmark. Second, we should publish an example on keras.io showing how to do SQuAD evaluation on a backbone.

We can start with writing the example in this repo.

Steps:

  • [ ] Add a squad.py file.
    • [ ] Loads the squad dataset via tfds.
    • [ ] Run squad evaluation on a BERT backbone.
  • [ ] Add a README.md with a description on how to run.

mattdangerw avatar Feb 10 '23 18:02 mattdangerw

@TheAthleticCoder, would you like to take up this issue?

abheesht17 avatar Feb 10 '23 18:02 abheesht17

Yes! I would like to take up the issue

TheAthleticCoder avatar Feb 10 '23 19:02 TheAthleticCoder

@TheAthleticCoder thanks! Let us know if you have questions, this is a significant piece of work.

One resource is the original BERT squad script -> https://github.com/google-research/bert/blob/master/run_squad.py

There is a lot of input messing around we will need to do, as shown in that script.

mattdangerw avatar Feb 10 '23 19:02 mattdangerw

Hey, so I was using these references and noticed that since it is span-based labelling, I would need to handle offsets as seen here: https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py#L386 and https://github.com/google-research/bert/blob/master/run_squad.py#L242 Should I do it using TensorFlow ops or can I use standard python objects along with tf.py_function

TheAthleticCoder avatar Feb 20 '23 11:02 TheAthleticCoder

I think for now we can forgo worrying about TensorFlow ops. Let's focus on a solution that is concise and readable.

Probably for now we can either:

  • Compute the preprocessed dataset in pure python, then convert to a tf.data.Dataset before training.
  • Compute the preprocessed dataset with tf.data and tf.py_function. Then use dataset.cache() the dataset before calling fit().

Either seems fine! I would go with whatever is most clear and readable for now.

Eventually, we should have a solution for offsets that is tf op friendly and baked into our library, but I think it makes sense to do that as a follow up. We can use this example to inform our API design down the road.

mattdangerw avatar Feb 22 '23 21:02 mattdangerw

Hey! I would like to take this issue up

jayam30 avatar Feb 23 '23 15:02 jayam30

Hey! I have handled the dataset part. Please check it out here: SQuAD

If there are any changes to be made, do let me know. cc: @abheesht17 @mattdangerw

TheAthleticCoder avatar Mar 09 '23 21:03 TheAthleticCoder

@TheAthleticCoder are you still planning to work on it?

shivance avatar Jul 29 '23 14:07 shivance

@shivance Hey, no I don't think I won't be able to find time to do it :( You can take it up 👍🏻

TheAthleticCoder avatar Jul 29 '23 19:07 TheAthleticCoder

Seems like an example already exists here - Keras Examples - Text Extraction with BERT

pri1311 avatar Aug 03 '23 23:08 pri1311

@abheesht17 can you assign this issue to me? if there is no one assigned.

abuelnasr0 avatar Aug 03 '23 23:08 abuelnasr0

Sure, @abuelnasr0. Assigned it to you, have fun!

@pri1311 - thanks for the pointer, will be extremely helpful for @abuelnasr0 when he tries using KerasNLP blocks for writing the example!

abheesht17 avatar Aug 04 '23 16:08 abheesht17