Add NLP-specific metrics
@mattdangerw and the keras-nlp team:
For standard classification metrics (AUC, F1, Precision, Recall, Accuracy, etc.), keras.metrics can be used. But there are several NLP-specific metrics which can be implemented here, i.e., we can expose native APIs for these metrics.
I would like to take this up. I can start with the popular ones first and open PRs. Let me know if this is something the team is looking to add!
I've listed a few metrics (this list is, by no means, comprehensive):
-
Perplexity
-
ROUGE paper Pretty standard metric for text generation. We can implement all variations: ROUGE-N, ROUGE-L, ROUGE-W, etc.
-
BLEU paper Another standard text generation metric. Note: We can also implement SacreBleu.
-
Character Error Rate, Word Error Rate, etc. paper
-
Pearson Coefficient and Spearman Coefficient Looks like
keras.metricsdoes not have these two metrics. They are not NLP-specific metrics...so, maybe, implementing them in Keras is better than implementing them here.
Thank you!
@abheesht17 nice to meet you and thanks for opening this! Metrics are definitely something we are interested in building in keras-nlp, and we would be very happy to have you contribute.
We don't have any metrics to date, so I think it might help to identify one of these to start with that we feel is particularly well-defined and popular (perhaps BLEU?).
From there it would be helpful to iterate on a design before a PR. A good way to do this might be to create a colab showing a possible interface and use cases (including an end-to-end training example). Then we can all discuss the interface, and convert to a PR from there.
Does that sound good to you?
cc @fchollet @chenmoneygithub for thoughts
@mattdangerw, sounds good to me! I'll share a Colab notebook soon 👍🏼
@abheesht17 I'd suggest adding perplexity as well as it's one of the trickier metrics to use. Especially since it often gives inconsistent results and hugely varying results (in orders of magnitude) across different implementations by different existing libraries in my experience
@aflah02, good point. Will do!
@mattdangerw , here is the notebook for perplexity: https://colab.research.google.com/drive/1BH1lTw_qLK6671oWaoU15IrUKSQfd6oE?usp=sharing.
Experiencing some difficulty with BLEU Score...hence, shared the perplexity notebook first. Please go through it and let me know if you have any suggestions!
Difficulty with BLEU Score: In TensorFlow, is there any data structure which resembles a dictionary (keys as tensors)? I want to have a dictionary, with keys as ngrams, and values as frequency of the respective ngram. I can't use a Python dictionary because AutoGraph doesn't allow it (I guess) and also because we can't have tensors as keys (they are unhashable). I tried using MutableHashTable, but it does not accept tensors as keys. The alternative is to store the tensor reference (as key)...but the lookup operation will be O(n) this way.
I'll get started with ROUGE, meanwhile.
Thanks very much! I will carve out some time to look over this soon!
One useful resource on perplexity if you haven't seen it is here. https://huggingface.co/docs/transformers/perplexity
I don't think the notebook you are sharing will calculate a correct perplexity for a fixed window. But I'm not sure that is feasible as a live metric during training.
Yeah, correct. It won't work for a fixed window. If we want a fixed window, we could make a separate function and ask the user to pass the model and the tokenised input (of the whole corpus, or at least of reasonably sized chunks of the corpus, if the corpus is too big to be loaded in memory?).
I checked a few scripts. Most of them just take the exponent of the loss at the end of every epoch/end of training: https://github.com/huggingface/transformers/search?p=2&q=perplexity
So, let me know what the best course of action is.
Also, let me know if there is a way to circumvent the difficulty I am currently facing with BLEU Score.
@mattdangerw, here are rough implementations of ROUGE-L: https://colab.research.google.com/drive/1xchVi4DsG_2gfi8FmO-g9GZS6-7CIFnz?usp=sharing and ROUGE-N: https://colab.research.google.com/drive/1jurZQeHH760TyOHkjjae-eqkjpC7qSTc?usp=sharing.
I checked if the code was working in graph mode by adding @tf.function, so ideally, it should work during training as well. I'll add a training example tomorrow (probably the NMT example on the repo) to verify again.
Let me know if there are any changes required. Thanks!
The only issue I would say is that the lookup operation is O(n). So, when I add an ngram to a TensorArray, I have to iterate over all elements in the TensorArray to check whether it has the ngram or not. I couldn't find a hash map sort of data structure in TensorFlow (which takes tensors as keys).
Added rough training examples for ROUGE-L and ROUGE-N.
Hey, @mattdangerw! Any thoughts on this? These are the three notebooks (rough implementations):
Perplexity: https://colab.research.google.com/drive/1BH1lTw_qLK6671oWaoU15IrUKSQfd6oE?usp=sharing ROUGE-L: https://colab.research.google.com/drive/1xchVi4DsG_2gfi8FmO-g9GZS6-7CIFnz?usp=sharing ROUGE-N: https://colab.research.google.com/drive/1jurZQeHH760TyOHkjjae-eqkjpC7qSTc?usp=sharing
Thanks!
Thanks! Sorry was actually out sick this week :/ so catching up now.
Let's open up separate issues for perplexity, ROUGE, and BLEU, as ideally those are all components we would like. Tracking everything on this issue will get tricky.
I think perplexity is the correct one to start with.
No issues, @mattdangerw! Take care :)
OK issues split out. I think that list of three is probably a good set to work through for now, but once we work through those we can open up issues for further metrics.
Let's focus on perplexity as a first example metric.
Great! Thank you, @mattdangerw!