keras-nlp Allow preparing a GLUE submission from BERT finetuning script

To actually run a GLUE submission (including tasks with private labels), we need to output a number of .tsv files per GLUE task. See the first item here: https://gluebenchmark.com/faq

We should allow our BERT finetuning script to do this file writing for us. We should add an output argument for a predictions file, run predictions on the test set, and use csv to write the correctly formatted output as described in by the GLUE faq.

Apr 14 '22 21:04 mattdangerw

@Stealth-py would you like to take a look at this too? It might make sense to work on in tandem with https://github.com/keras-team/keras-nlp/issues/114

Apr 14 '22 21:04 mattdangerw

Yeah, I can work on this!

Apr 14 '22 21:04 Stealth-py

@Stealth-py thanks!

Apr 14 '22 21:04 mattdangerw

Sorry for the delay on this issue. Still a bit wrapped up with college stuff, will try to finish this ASAP!

May 03 '22 20:05 Stealth-py

No worries at all! Figured this would be sequential after #114

Thanks for all the work here.

May 03 '22 21:05 mattdangerw

Hey @mattdangerw, I think I spotted an issue with working on glue/mnli tasks with the glue_finetuning script. I was trying to look around what type of predictions the model returns so I tried working with the mnli task, but while evaluating for the same, it returned an error with respect to the loss function used. Not sure why this happened, so I tried to change it to some other loss function, here I used binary_crossentropy, which seemed to work just fine, but the training would've taken a lot of time so had to force close it. The error: Node: 'sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits' Received a label value of -1 which is outside the valid range of [0, 3). Label values: -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 [[{{node sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits}}]] [Op:__inference_test_function_12582]

May 10 '22 23:05 Stealth-py

@Stealth-py hey! Few things here...

First, just FYI, we had a big bug in our finetuning script where we would output logits to a loss that was expecting softmax probs :). So loss was not really going anywhere reasonable. @chenmoneygithub has a fix which should be in shortly https://github.com/keras-team/keras-nlp/pull/176/files

The error you are seeing was just discussed a bit in https://github.com/keras-team/keras-nlp/issues/175. Basically, those -1 label are expected. The test set of these GLUE dataset has private labels, meaning we the only way we can really evaluate test set performance is to prepare a GLUE submission with our predict values for those labels.

The submission is what we can do on this issue, but we probably want to clean up the line here somehow. It really only makes sense to call evaluate when the labels for your given task are not private.

May 11 '22 00:05 mattdangerw

Ah, yeah that should be it. Makes sense, because it got me wondering why it seemed to work with binary crossentropy instead. I looked at the thread after making the comment xD which is when I realized.

That's true yeah, so basically we find out the predictions on the test dataset using model.predict() and then pass the predictions over to model.evaluate() if there isn't any label -1 in the test dataset? Like, there isn't any -1 label values in glue/mrpc so we can call evaluate but that's not the same for datasets like glue/sst2 and glue/mnli. So, iterating through the datasets and finding if there are any -1 labels should be it I suppose.

May 11 '22 13:05 Stealth-py

Yeah, maybe we do something like this...

Remove to do_evaluation flag. If the test data has real labels run evaluate automatically, if not skip.
Add an output_tsv_file flag. For all tasks, if set, generate a tsv in the correct format for a glue submission.
Add a shell script blurb to the readme, showing how to iterate over every tasks, create a tsv, and then zip the tsvs to submit on the glue home page.

Does that make sense to you?

May 11 '22 17:05 mattdangerw

Yeah. And, for the evaluation part, I think we can just hard-code the datasets where the labels are private. I suppose that would be better than iterating through each dataset everytime.

May 11 '22 22:05 Stealth-py