OpusCleaner
OpusCleaner copied to clipboard
Support monolingual datasets
OpusCleaner doesn't support monolingual data out of the box. Some filters support it, but the interface does not. Re #139.
Steps:
- [ ] Check that all filters support monolingual data. Maybe related to #130 as it talks about treating datasets more as tables with columns.
- [ ] Data discovery (
datasets.py) needs to identify datasets + language code from filename. - [ ] Data download (
download.py) needs to handle the monolingual$DATASET.$LANG.gzfile as is. No extraction necessary. - [ ] Interface needs to support datasets with one column