OpusCleaner icon indicating copy to clipboard operation
OpusCleaner copied to clipboard

Support monolingual datasets

Open jelmervdl opened this issue 2 years ago • 0 comments

OpusCleaner doesn't support monolingual data out of the box. Some filters support it, but the interface does not. Re #139.

Steps:

  • [ ] Check that all filters support monolingual data. Maybe related to #130 as it talks about treating datasets more as tables with columns.
  • [ ] Data discovery (datasets.py) needs to identify datasets + language code from filename.
  • [ ] Data download (download.py) needs to handle the monolingual $DATASET.$LANG.gz file as is. No extraction necessary.
  • [ ] Interface needs to support datasets with one column

jelmervdl avatar Jan 07 '24 23:01 jelmervdl