Bug fix CNN/DM and XSum initialization
This PR builds on the changes made in PR #16, where a bug related to the CNNDM variable in summac/benchmark.py was identified.
Specifically, when the CNNDM variable is undefined, lines 44 and 54 can raise a NameError: name 'CNNDM' is not defined:
https://github.com/tingofurro/summac/blob/9e4f35722b635402c6fd5a1399d987bc80b45b43/summac/benchmark.py#L41-L61
The fix proposed by @forrestbao effectively resolve this issu, but during testing I noticed some performance concerns.>
To address this, I propose an alternative fix that leverages class variables. Through this new approach:
- The CNN/DM and XSum datasets are only loaded when an instance of the
SummaCBenchmarkclass is created. - The datasets are loaded only once and reused across all instances of the class.
In addition, the GDrive link for the SummEval dataset provided in lines 271-276 (apparently from the 4/19/2020 update on the README.md file of the original repo https://github.com/Yale-LILY/SummEval) is broken.
https://github.com/tingofurro/summac/blob/9e4f35722b635402c6fd5a1399d987bc80b45b43/summac/benchmark.py#L271-L276
I replaced it with a valid GCS bucket link that can be found on the README.md file of the same repo under the "Human annotations" header (https://storage.googleapis.com/sfr-summarization-repo-research/model_annotations.aligned.jsonl)
Changes
- Fixed
NameErrorby modified dataset loading in theSummaCBenchmarkclass. - Replaced broken GDrive link with a working GCS bucket link, for the SummEval dataset.