xnmt
xnmt copied to clipboard
Reduce intermediary files during preprocessing
trafficstars
It would be nice to have a way to reduce the number of intermediary files used during preprocessing by either
- Having an option to remove the output files for any preprocessing task (before training begins)
- Only specify one input/output file pair and have all preprocessing take place in xnmt without writing intermediate result to disk
For instance, one in one of my experiments I need to:
- filter out empty sentences
- lowercase
- tokenize
- filter out sentences that are too long (necessary after tokenization sometimes)
Which creates 2-6 files each time ([train+dev+test]*2). For big corpora this is a bit wasteful.
I think this is a good idea, but the only thing that makes it complicated is the interaction with the fact that preprocessing can avoid re-generating files when they already exist, which saves time on multiple runs. If the files are deleted, they'll be re-generated. I wonder if there's a good design about how to handle this without adding too much complexity?