Allow preparing a GLUE submission from BERT finetuning script
To actually run a GLUE submission (including tasks with private labels), we need to output a number of .tsv files per GLUE task. See the first item here: https://gluebenchmark.com/faq
We should allow our BERT finetuning script to do this file writing for us. We should add an output argument for a predictions file, run predictions on the test set, and use csv to write the correctly formatted output as described in by the GLUE faq.
@Stealth-py would you like to take a look at this too? It might make sense to work on in tandem with https://github.com/keras-team/keras-nlp/issues/114
Yeah, I can work on this!
@Stealth-py thanks!
Sorry for the delay on this issue. Still a bit wrapped up with college stuff, will try to finish this ASAP!
No worries at all! Figured this would be sequential after #114
Thanks for all the work here.
Hey @mattdangerw, I think I spotted an issue with working on glue/mnli tasks with the glue_finetuning script.
I was trying to look around what type of predictions the model returns so I tried working with the mnli task, but while evaluating for the same, it returned an error with respect to the loss function used. Not sure why this happened, so I tried to change it to some other loss function, here I used binary_crossentropy, which seemed to work just fine, but the training would've taken a lot of time so had to force close it.
The error:
Node: 'sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits' Received a label value of -1 which is outside the valid range of [0, 3). Label values: -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 [[{{node sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits}}]] [Op:__inference_test_function_12582]
@Stealth-py hey! Few things here...
First, just FYI, we had a big bug in our finetuning script where we would output logits to a loss that was expecting softmax probs :). So loss was not really going anywhere reasonable. @chenmoneygithub has a fix which should be in shortly https://github.com/keras-team/keras-nlp/pull/176/files
The error you are seeing was just discussed a bit in https://github.com/keras-team/keras-nlp/issues/175. Basically, those -1 label are expected. The test set of these GLUE dataset has private labels, meaning we the only way we can really evaluate test set performance is to prepare a GLUE submission with our predict values for those labels.
The submission is what we can do on this issue, but we probably want to clean up the line here somehow. It really only makes sense to call evaluate when the labels for your given task are not private.
Ah, yeah that should be it. Makes sense, because it got me wondering why it seemed to work with binary crossentropy instead. I looked at the thread after making the comment xD which is when I realized.
That's true yeah, so basically we find out the predictions on the test dataset using model.predict() and then pass the predictions over to model.evaluate() if there isn't any label -1 in the test dataset? Like, there isn't any -1 label values in glue/mrpc so we can call evaluate but that's not the same for datasets like glue/sst2 and glue/mnli. So, iterating through the datasets and finding if there are any -1 labels should be it I suppose.
Yeah, maybe we do something like this...
- Remove to
do_evaluationflag. If the test data has real labels run evaluate automatically, if not skip. - Add an
output_tsv_fileflag. For all tasks, if set, generate a tsv in the correct format for a glue submission. - Add a shell script blurb to the readme, showing how to iterate over every tasks, create a tsv, and then zip the tsvs to submit on the glue home page.
Does that make sense to you?
Yeah. And, for the evaluation part, I think we can just hard-code the datasets where the labels are private. I suppose that would be better than iterating through each dataset everytime.