OpusCleaner icon indicating copy to clipboard operation
OpusCleaner copied to clipboard

Automatically derive filters based on a clean sample provded by the user.

Open PinzhenChen opened this issue 1 year ago • 5 comments

In practice I would have big noisy training data and sample clean data that is representative of the downstream task (e.g. wmt validation sets).

It is still difficulty for me to decide on the values for the filters, for example, should I choose a source_word_ratio of 0.4 or 0.5, especially if I do not speak both languages. There are many filters and values to search for. This is largely empirical and it is also hard to attribute the final system's BLEU/COMET to a specific value change.

If I provide a small clean data that is sufficiently representative of the test domain, can the tool automatically run to derive some rules/values for me? Maybe the tool should search for and return the filter values that are "extreme" enough yet do not lead to the provided clean data being filtered out?

PinzhenChen avatar Jan 17 '24 15:01 PinzhenChen