mammoth
mammoth copied to clipboard
Interfacing Mammoth and OpusFilter
In the long run, rather than having our custom transforms for data cleaning (as suggested b y #13), it would be better to leave it to a relevant third party, such as OpusFilter.
I'm mostly opening this issue for discussion and as a long-term project rather than expecting this to be handled soon.
As far as I can tell, one would need:
- [ ] a contribution to OpusFilter to allow it to read and write through piping (or selecting some other tool that allows cleanup through piping)
- [ ] a transform to handle passing the data from our iterators to the third party software for cleaning and back into the system:
- [ ] booting a pipe in the
warmupmethod - [ ] implement
apply through e.g. apopen.read` - [ ] a graceful closure when training ends.
- [ ] booting a pipe in the
- [ ] (optionally) a deprecation of current filters if their are no longer needed or redundant.
Note that there are several challenges ahead: in particular,
- one will have to find a way around sub-process creation in multi-node settings
- it's unclear how much effort would be required to implement the required behavior on the third party side (OpusFilter or other).