yoyodyne icon indicating copy to clipboard operation
yoyodyne copied to clipboard

Benchmarking

Open kylebgorman opened this issue 2 years ago • 2 comments
trafficstars

We should add a benchmarking suite. I have reserved a separate repo, CUNY-CL/yoyodyne-benchmarks for this.

Here are a list of shared tasks (and related papers) from which we can pull data:

The benchmark itself is a collection of two tables.

  • A "KPI" table, per dataset/language. E.g., "SIGMORPHON 2021 g2p Bulgarian".
  • A "study" table, per dataset/language/architecture. E.g., "Transducer ensemble on SIGMORPHON 2021 g2p Bulgarian".

A single script should compute all KPI statistics and dump it out as a TSV. This table should include:

  • training set size
  • dev set size
  • test set size
  • average input string length
  • average output string length
  • whether it has features

While one could imagine a single script which performs all studies, this is probably not wise. Rather, these should be grouped into separate scripts based on their functionality (though it may make sense for there to be multiple studies per study script; e.g., we coudl have one script per dataset/language pair). The results can be dumped out in some structured format (JSON), and a separate script can be used to aggregate the non-ragged portions of all the JSON study reports into a single TSV. This table should include:

  • dataset
  • language
  • model type
  • GPU models (e.g., torch.cuda.get_device_name(number))
  • wall clock time during training
  • wall clock time during inference
  • development accuracy of best model
  • test accuracy of best model
  • hyperparameters for best model (this is the ragged part)
  • model size, in KB
  • model size, in # of trainable parameters

Then, a separate script is used to aggregate the non-ragged portions of the extant study observations.

Studies should include:

  • the worst of 5 randomly initialized models
  • the best of 5 randomly initialized models
  • the media of 5 randomly initialized models
  • a voting ensemble of 5 randomly initialized models
  • possibly: heterogeneous ensembles of different architectures

Putting this all together should make it easy for us to win relevant shared tasks. ;)

This is related to #5, as the refactoring there should make this much easier. This is also related to #15; we may want to use the sweeping interface for the benchmarks.

kylebgorman avatar Dec 09 '22 19:12 kylebgorman

Had this thought the other night, what about the Google normalization tasks for English and Russian? (Not that we don't have enough already...)

bonham79 avatar Dec 09 '22 22:12 bonham79

Our way of doing that (e.g., in Zhang et al. 2019 and earlier papers) was way more constrained than generalized sequence-to-sequence learning, so I think we’d have to basically have to implement an alternative “task”, possibly with multiple layers of prediction, and this seems like a big lift to me.

kylebgorman avatar Dec 09 '22 23:12 kylebgorman