yoyodyne
yoyodyne copied to clipboard
Benchmarking
We should add a benchmarking suite. I have reserved a separate repo, CUNY-CL/yoyodyne-benchmarks for this.
Here are a list of shared tasks (and related papers) from which we can pull data:
- Merhav & Ash (2018) transliteration
- Other transliteration tasks:
- SIGMORPHON 2016 inflection
- SIGMORPHON 2017 inflection
- SIGMORPHON 2018 inflection
- SIGMORPHON 2020 g2p
- SIGMORPHON 2021 g2p
- New York-Boulder abstractness data
The benchmark itself is a collection of two tables.
- A "KPI" table, per dataset/language. E.g., "SIGMORPHON 2021 g2p Bulgarian".
- A "study" table, per dataset/language/architecture. E.g., "Transducer ensemble on SIGMORPHON 2021 g2p Bulgarian".
A single script should compute all KPI statistics and dump it out as a TSV. This table should include:
- training set size
- dev set size
- test set size
- average input string length
- average output string length
- whether it has features
While one could imagine a single script which performs all studies, this is probably not wise. Rather, these should be grouped into separate scripts based on their functionality (though it may make sense for there to be multiple studies per study script; e.g., we coudl have one script per dataset/language pair). The results can be dumped out in some structured format (JSON), and a separate script can be used to aggregate the non-ragged portions of all the JSON study reports into a single TSV. This table should include:
- dataset
- language
- model type
- GPU models (e.g.,
torch.cuda.get_device_name(number)) - wall clock time during training
- wall clock time during inference
- development accuracy of best model
- test accuracy of best model
- hyperparameters for best model (this is the ragged part)
- model size, in KB
- model size, in # of trainable parameters
Then, a separate script is used to aggregate the non-ragged portions of the extant study observations.
Studies should include:
- the worst of 5 randomly initialized models
- the best of 5 randomly initialized models
- the media of 5 randomly initialized models
- a voting ensemble of 5 randomly initialized models
- possibly: heterogeneous ensembles of different architectures
Putting this all together should make it easy for us to win relevant shared tasks. ;)
This is related to #5, as the refactoring there should make this much easier. This is also related to #15; we may want to use the sweeping interface for the benchmarks.
Had this thought the other night, what about the Google normalization tasks for English and Russian? (Not that we don't have enough already...)
Our way of doing that (e.g., in Zhang et al. 2019 and earlier papers) was way more constrained than generalized sequence-to-sequence learning, so I think we’d have to basically have to implement an alternative “task”, possibly with multiple layers of prediction, and this seems like a big lift to me.