summac icon indicating copy to clipboard operation
summac copied to clipboard

Bug fix CNN/DM and XSum initialization

Open alejorba opened this issue 1 year ago • 0 comments

This PR builds on the changes made in PR #16, where a bug related to the CNNDM variable in summac/benchmark.py was identified.

Specifically, when the CNNDM variable is undefined, lines 44 and 54 can raise a NameError: name 'CNNDM' is not defined: https://github.com/tingofurro/summac/blob/9e4f35722b635402c6fd5a1399d987bc80b45b43/summac/benchmark.py#L41-L61

The fix proposed by @forrestbao effectively resolve this issu, but during testing I noticed some performance concerns.>

To address this, I propose an alternative fix that leverages class variables. Through this new approach:

  • The CNN/DM and XSum datasets are only loaded when an instance of the SummaCBenchmark class is created.
  • The datasets are loaded only once and reused across all instances of the class.

In addition, the GDrive link for the SummEval dataset provided in lines 271-276 (apparently from the 4/19/2020 update on the README.md file of the original repo https://github.com/Yale-LILY/SummEval) is broken. https://github.com/tingofurro/summac/blob/9e4f35722b635402c6fd5a1399d987bc80b45b43/summac/benchmark.py#L271-L276

I replaced it with a valid GCS bucket link that can be found on the README.md file of the same repo under the "Human annotations" header (https://storage.googleapis.com/sfr-summarization-repo-research/model_annotations.aligned.jsonl)

Changes

  • Fixed NameError by modified dataset loading in the SummaCBenchmark class.
  • Replaced broken GDrive link with a working GCS bucket link, for the SummEval dataset.

alejorba avatar Dec 06 '24 16:12 alejorba