OpusCleaner icon indicating copy to clipboard operation
OpusCleaner copied to clipboard

Opusfilter backing

Open jelmervdl opened this issue 2 years ago • 1 comments

Experimental branch that uses OpusFilter for all of the tools and processing.

Ideally we'd add compatibility for our external filters (the .json files) as well because I like that extensibility a lot. But OpusFilter already comes with some useful ones. And implementing our own filters in Python is also doable.

I'm also going to use this Pull Request as a little notepad for things I find in OpusFilter that I need to write down somewhere so I can have someone else look at whether it makes sense.

Notes on OpusFilter

Because this is just from reading the source, not from actually trying it. So I might be wrong.

RegExpPreprocessor

Does the RegExpPreprocessor work? It seems to do double compilation of lang_patterns: https://github.com/Helsinki-NLP/OpusFilter/blob/9f6636960a21a673f80308e8bd36216cdb144caa/opusfilter/preprocessors.py#L93-L98

Filter pipeline implementation

FilterABC has a filter base implementation that’s pretty naievely calling self.score with a single pair: https://github.com/Helsinki-NLP/OpusFilter/blob/9f6636960a21a673f80308e8bd36216cdb144caa/opusfilter/init.py#L50-L54

It looks as if that naïve implementation is called in the pipeline: https://github.com/Helsinki-NLP/OpusFilter/blob/9f6636960a21a673f80308e8bd36216cdb144caa/opusfilter/pipeline.py#L94-L98

… which all in all feels wrong given how much attention is given to do proper chunking in the steps before it, and all of the filter implementations being generators. None of the actual filters make use of batching, but I’d say that would be a useful thing once you’d add filters like LASER.

Separation of preprocessors and filters and intermediate output files

It is useful that OpusFilter can read file formats as part of processing steps, but the downside is that each step has to name input and output files. When mixing processing and filtering steps, this forces you to write intermediate data to disk. Maybe empty-train should have a more strict distinction between filtering and preprocessing. But from a user perspective… is that what you’d want? Say I’d like to filter out the obvious trash first, then preprocess the remainder to be as good as possible, and then use the expensive filters to filter out the lower quality stuff.

jelmervdl avatar Oct 18 '22 10:10 jelmervdl