firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Adjust OpusTrainer settings in CI

Open eu9ene opened this issue 1 year ago • 0 comments

OpuTrainer buffers too much data to prepare for real training, but our dataset for CI is tiny. This leads to reading the dataset many times and a very long preprocessing. We should adjust settings like --chunk-size --batch-size and --workers.

Example student task from CI: https://firefox-ci-tc.services.mozilla.com/tasks/AfSTqkqEQ5-vtmxJjkS3ZA/runs/0/logs/public/logs/live.log

Full help:

usage: opustrainer-train [-h] --config CONFIG [--state STATE] [--sync] [--temporary-directory TEMPORARY_DIRECTORY] [--do-not-resume] [--no-shuffle]
                         [--batch-size BATCH_SIZE] [--chunk-size CHUNK_SIZE] [--workers WORKERS] [--log-level LOG_LEVEL] [--log-file LOG_FILE]
                         ...

Feeds marian tsv data for training.

positional arguments:
  trainer               Trainer program that gets fed the input. If empty it is read from config.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG, -c CONFIG
                        YML configuration input.
  --state STATE, -s STATE
                        YML state file, defaults to ${CONFIG}.state.
  --sync                Do not shuffle async
  --temporary-directory TEMPORARY_DIRECTORY, -T TEMPORARY_DIRECTORY
                        Temporary dir, used for shuffling and tracking state
  --do-not-resume, -d   Do not resume from the previous training state
  --no-shuffle, -n      Do not shuffle, for debugging
  --batch-size BATCH_SIZE, -b BATCH_SIZE
                        Batch size
  --chunk-size CHUNK_SIZE, -B CHUNK_SIZE
                        Chunk size of batches fed to modifiers
  --workers WORKERS, -j WORKERS
                        Number of workers
  --log-level LOG_LEVEL
                        Set log level. Available levels: DEBUG, INFO, WARNING, ERROR, CRITICAL. Default is INFO
  --log-file LOG_FILE, -l LOG_FILE
                        Target location for logging. Always logs to stderr and optionally to a file.

eu9ene avatar Feb 16 '24 17:02 eu9ene