mammoth icon indicating copy to clipboard operation
mammoth copied to clipboard

Interfacing Mammoth and OpusFilter

Open TimotheeMickus opened this issue 2 years ago • 0 comments

In the long run, rather than having our custom transforms for data cleaning (as suggested b y #13), it would be better to leave it to a relevant third party, such as OpusFilter.

I'm mostly opening this issue for discussion and as a long-term project rather than expecting this to be handled soon.

As far as I can tell, one would need:

  • [ ] a contribution to OpusFilter to allow it to read and write through piping (or selecting some other tool that allows cleanup through piping)
  • [ ] a transform to handle passing the data from our iterators to the third party software for cleaning and back into the system:
    • [ ] booting a pipe in the warmup method
    • [ ] implement apply through e.g. a popen.read`
    • [ ] a graceful closure when training ends.
  • [ ] (optionally) a deprecation of current filters if their are no longer needed or redundant.

Note that there are several challenges ahead: in particular,

  • one will have to find a way around sub-process creation in multi-node settings
  • it's unclear how much effort would be required to implement the required behavior on the third party side (OpusFilter or other).

TimotheeMickus avatar Sep 22 '23 11:09 TimotheeMickus