OpusCleaner
OpusCleaner copied to clipboard
Workflow
Sorry bad title need to jot down some notes.
Empty-train workflow, long version (maybe you can skip steps?)
- Select datasets
- Download each dataset
- Generate samples
- Select filters for each dataset
- Select a category for each dataset
- Run filters on each dataset (highly parallel)
- Combine and deduplicate datasets (parallel per category, maybe, #41)
- create trainer.py configuration using categories and deduplicated files from previous step
- Generate alignments for placeholders, for training guided alignment(?) (parallel)
- Run trainer.py to train model
We need at least a workflow manager to manage 4..7, maybe 8.