keras-nlp Add NLP-specific metrics

@mattdangerw and the keras-nlp team:

For standard classification metrics (AUC, F1, Precision, Recall, Accuracy, etc.), keras.metrics can be used. But there are several NLP-specific metrics which can be implemented here, i.e., we can expose native APIs for these metrics.

I would like to take this up. I can start with the popular ones first and open PRs. Let me know if this is something the team is looking to add!

I've listed a few metrics (this list is, by no means, comprehensive):

Perplexity
ROUGE paper Pretty standard metric for text generation. We can implement all variations: ROUGE-N, ROUGE-L, ROUGE-W, etc.
BLEU paper Another standard text generation metric. Note: We can also implement SacreBleu.
BertScore paper, code
Bleurt paper, code
(character n-gram F-score) chrF and chrF++ paper, code
COMET paper, code
Character Error Rate, Word Error Rate, etc. paper
Pearson Coefficient and Spearman Coefficient Looks like keras.metrics does not have these two metrics. They are not NLP-specific metrics...so, maybe, implementing them in Keras is better than implementing them here.

Thank you!

Mar 09 '22 04:03 abheesht17

@abheesht17 nice to meet you and thanks for opening this! Metrics are definitely something we are interested in building in keras-nlp, and we would be very happy to have you contribute.

We don't have any metrics to date, so I think it might help to identify one of these to start with that we feel is particularly well-defined and popular (perhaps BLEU?).

From there it would be helpful to iterate on a design before a PR. A good way to do this might be to create a colab showing a possible interface and use cases (including an end-to-end training example). Then we can all discuss the interface, and convert to a PR from there.

Does that sound good to you?

cc @fchollet @chenmoneygithub for thoughts

Mar 10 '22 06:03 mattdangerw

@mattdangerw, sounds good to me! I'll share a Colab notebook soon 👍🏼

Mar 10 '22 07:03 abheesht17

@abheesht17 I'd suggest adding perplexity as well as it's one of the trickier metrics to use. Especially since it often gives inconsistent results and hugely varying results (in orders of magnitude) across different implementations by different existing libraries in my experience

Mar 13 '22 12:03 aflah02

@aflah02, good point. Will do!

Mar 13 '22 13:03 abheesht17

@mattdangerw , here is the notebook for perplexity: https://colab.research.google.com/drive/1BH1lTw_qLK6671oWaoU15IrUKSQfd6oE?usp=sharing.

Experiencing some difficulty with BLEU Score...hence, shared the perplexity notebook first. Please go through it and let me know if you have any suggestions!

Difficulty with BLEU Score: In TensorFlow, is there any data structure which resembles a dictionary (keys as tensors)? I want to have a dictionary, with keys as ngrams, and values as frequency of the respective ngram. I can't use a Python dictionary because AutoGraph doesn't allow it (I guess) and also because we can't have tensors as keys (they are unhashable). I tried using MutableHashTable, but it does not accept tensors as keys. The alternative is to store the tensor reference (as key)...but the lookup operation will be O(n) this way.

I'll get started with ROUGE, meanwhile.

Mar 14 '22 09:03 abheesht17

Thanks very much! I will carve out some time to look over this soon!

One useful resource on perplexity if you haven't seen it is here. https://huggingface.co/docs/transformers/perplexity

I don't think the notebook you are sharing will calculate a correct perplexity for a fixed window. But I'm not sure that is feasible as a live metric during training.

Mar 17 '22 01:03 mattdangerw

Yeah, correct. It won't work for a fixed window. If we want a fixed window, we could make a separate function and ask the user to pass the model and the tokenised input (of the whole corpus, or at least of reasonably sized chunks of the corpus, if the corpus is too big to be loaded in memory?).

I checked a few scripts. Most of them just take the exponent of the loss at the end of every epoch/end of training: https://github.com/huggingface/transformers/search?p=2&q=perplexity

So, let me know what the best course of action is.

Mar 17 '22 03:03 abheesht17

Also, let me know if there is a way to circumvent the difficulty I am currently facing with BLEU Score.

Mar 17 '22 03:03 abheesht17

@mattdangerw, here are rough implementations of ROUGE-L: https://colab.research.google.com/drive/1xchVi4DsG_2gfi8FmO-g9GZS6-7CIFnz?usp=sharing and ROUGE-N: https://colab.research.google.com/drive/1jurZQeHH760TyOHkjjae-eqkjpC7qSTc?usp=sharing.

I checked if the code was working in graph mode by adding @tf.function, so ideally, it should work during training as well. I'll add a training example tomorrow (probably the NMT example on the repo) to verify again.

Let me know if there are any changes required. Thanks!

Mar 20 '22 18:03 abheesht17

The only issue I would say is that the lookup operation is O(n). So, when I add an ngram to a TensorArray, I have to iterate over all elements in the TensorArray to check whether it has the ngram or not. I couldn't find a hash map sort of data structure in TensorFlow (which takes tensors as keys).

Mar 21 '22 17:03 abheesht17

Added rough training examples for ROUGE-L and ROUGE-N.

Mar 21 '22 17:03 abheesht17

Hey, @mattdangerw! Any thoughts on this? These are the three notebooks (rough implementations):

Perplexity: https://colab.research.google.com/drive/1BH1lTw_qLK6671oWaoU15IrUKSQfd6oE?usp=sharing ROUGE-L: https://colab.research.google.com/drive/1xchVi4DsG_2gfi8FmO-g9GZS6-7CIFnz?usp=sharing ROUGE-N: https://colab.research.google.com/drive/1jurZQeHH760TyOHkjjae-eqkjpC7qSTc?usp=sharing

Thanks!

Mar 23 '22 14:03 abheesht17

Thanks! Sorry was actually out sick this week :/ so catching up now.

Let's open up separate issues for perplexity, ROUGE, and BLEU, as ideally those are all components we would like. Tracking everything on this issue will get tricky.

I think perplexity is the correct one to start with.

Mar 25 '22 19:03 mattdangerw

No issues, @mattdangerw! Take care :)

Mar 25 '22 19:03 abheesht17

OK issues split out. I think that list of three is probably a good set to work through for now, but once we work through those we can open up issues for further metrics.

Let's focus on perplexity as a first example metric.

Mar 25 '22 21:03 mattdangerw

Great! Thank you, @mattdangerw!

Mar 25 '22 21:03 abheesht17