OpusCleaner issues

Opusfilter backing

1

Experimental branch that uses OpusFilter for all of the tools and processing. Ideally we'd add compatibility for our external filters (the .json files) as well because I like that extensibility...

jelmervdl

Filter parameter type for languages

Many filters need to know the languages of the two columns. It could be useful if those can be filled in automatically. The interface knows the language of the two...

jelmervdl

enhancement

component:ui

Partial execution of filter pipeline

In the category _lessons from Paracrawl_: It is sometimes very useful to split the actual filtering pipeline into a couple of steps that are then executed on different hardware. For...

jelmervdl

enhancement

help wanted

component:execution

Add parameter value hint in the json

9

https://github.com/jelmervdl/empty-train/blob/5edf6e20b7fe381ca87d4abe9aa4fcf1985b63ef/main.py#L31 We should have optionally a list of possible values, or something like a datatype hint (eg int)

XapaJIaMnu

enhancement

component:ui

Chinese Traditional <-> Simplified

3

The last time I've worked with this it was using [OpenCC](https://pypi.org/project/OpenCC/). It is much more up to date and seems to have an active community. Las release from hanziconv is...

ZJaume

component:filter

Corpus error detections

2

Some corpora would be contaminated with xml/html/csv and other types of artefacts. We should have a category of filters that is just `grep -n` the full dataset with this, and...

XapaJIaMnu

enhancement

Build alphabet support from CLDR data

In PR #157 I added additional alphabet support. This information is available by professional translators in the CLDR data: https://github.com/unicode-org/cldr-json/blob/0876ec40e13d54c0a6b6456392802d4de7e059cb/cldr-json/cldr-misc-full/main/sl/characters.json It would be nice to consume that JSON and automate...

gregtatum

Possible unpredicted behaviour

> https://github.com/facebookresearch/fastText has been archived by the owner on Mar 19, 2024. It is now read-only. `fastText` uses `numpy` ([link](https://github.com/facebookresearch/fastText/blob/1142dc4c4ecbc19cc16eee5cdd28472e689267e6/setup.py#L197)). `numpy` recently updated to `2.0.0`. Also, other dependencies can be...

rggdmonk

OpusCleaner
OpusCleaner copied to clipboard

Metadata

Opusfilter backing

Filter parameter type for languages

Partial execution of filter pipeline

Add parameter value hint in the json

Chinese Traditional <-> Simplified

Corpus error detections

Build alphabet support from CLDR data

Possible unpredicted behaviour

← Metadata

Owner

Metadata

OpusCleaner OpusCleaner copied to clipboard

Metadata

← Metadata

Owner

Metadata

OpusCleaner
OpusCleaner copied to clipboard