OpusCleaner icon indicating copy to clipboard operation
OpusCleaner copied to clipboard

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Results 58 OpusCleaner issues
Sort by recently updated
recently updated
newest added

Sometimes when I am viewing changes, the GUI shows that the whole file has changed. I think this happens with the fix-quotes filter (but may happen with others). It shows...

I'm checking out this rule, and found some entries that were discarded which seemed valid to me. Mostly punctuation seems to be getting in the way. | English | Spanish...

bug

ATM there is no boundary between the select ones and available ones. Also some filters only need to be used once (like "remove whitespace"). maybe when they are selected, they...

enhancement

I am trying to understand the intended workflow for OpusCleaner. Suppose I want to build some MT systems. I fire up OpusCleaner, download some data, apply cleaning rules until I...

enhancement

If you set `DATA_PATH` to something other than the default, then you can download data successfully, but it does not show up on the data listing page. This is because...

bug

How to reproduce: Install the requirements-all.txt Load a corpus, and navigate to filters Attempt to add the "detokenizer" rule. The following error is produced: ``` Usage: sacremoses [OPTIONS] COMMAND [ARGS]......

bug
component:filter

Often adding a filter fails, because there are no default arguments passed to the filter function. For example many functions require source and target language and this could be set...

enhancement
help wanted
good first issue

OpusCleaner doesn't support monolingual data out of the box. Some filters support it, but the interface does not. Re #139. Steps: - [ ] Check that all filters support monolingual...

enhancement
help wanted

I don't see that I can cancel it. Even stopping / starting opuscleaner leaves it in the dimmed out state. I'm assuming I'll have to go in and try to...

It would be nice to see how much time is left on these bigger datasets.