OpusCleaner issues

Using the fix-quotes filter, and viewing the changes, makes it look as though the file has been replaced.

1

Sometimes when I am viewing changes, the GUI shows that the whole file has changed. I think this happens with the fix-quotes filter (but may happen with others). It shows...

bhaddow

num_mismatch discards some useful entries

5

I'm checking out this rule, and found some entries that were discarded which seemed valid to me. Mostly punctuation seems to be getting in the way. | English | Spanish...

gregtatum

bug

add a separater between selected fillters and the filter pool

ATM there is no boundary between the select ones and available ones. Also some filters only need to be used once (like "remove whitespace"). maybe when they are selected, they...

PinzhenChen

enhancement

Should OpusCleaner have the notion of a "project"?

3

I am trying to understand the intended workflow for OpusCleaner. Suppose I want to build some MT systems. I fire up OpusCleaner, download some data, apply cleaning rules until I...

bhaddow

enhancement

The configuration of data searching and downloading directories is not linked

1

If you set `DATA_PATH` to something other than the default, then you can download data successfully, but it does not show up on the data listing page. This is because...

bhaddow

bug

Using the "detokenizer" filter rule gives an error

1

How to reproduce: Install the requirements-all.txt Load a corpus, and navigate to filters Attempt to add the "detokenizer" rule. The following error is produced: ``` Usage: sacremoses [OPTIONS] COMMAND [ARGS]......

bhaddow

bug

component:filter

There should be sensible defaults for filters wherever possible

1

Often adding a filter fails, because there are no default arguments passed to the filter function. For example many functions require source and target language and this could be set...

bhaddow

enhancement

help wanted

good first issue

Support monolingual datasets

OpusCleaner doesn't support monolingual data out of the box. Some filters support it, but the interface does not. Re #139. Steps: - [ ] Check that all filters support monolingual...

jelmervdl

enhancement

help wanted

Cutting off internet during download leaves the download in a broken state.

2

I don't see that I can cancel it. Even stopping / starting opuscleaner leaves it in the dimmed out state. I'm assuming I'll have to go in and try to...

gregtatum

Show the download percentage in the UI

It would be nice to see how much time is left on these bigger datasets.

gregtatum

OpusCleaner
OpusCleaner copied to clipboard

Metadata

Using the fix-quotes filter, and viewing the changes, makes it look as though the file has been replaced.

num_mismatch discards some useful entries

add a separater between selected fillters and the filter pool

Should OpusCleaner have the notion of a "project"?

The configuration of data searching and downloading directories is not linked

Using the "detokenizer" filter rule gives an error

There should be sensible defaults for filters wherever possible

Support monolingual datasets

Cutting off internet during download leaves the download in a broken state.

Show the download percentage in the UI

← Metadata

Owner

Metadata

OpusCleaner OpusCleaner copied to clipboard

Metadata

← Metadata

Owner

Metadata

OpusCleaner
OpusCleaner copied to clipboard