xnmt icon indicating copy to clipboard operation
xnmt copied to clipboard

Reduce intermediary files during preprocessing

Open pmichel31415 opened this issue 7 years ago • 1 comments
trafficstars

It would be nice to have a way to reduce the number of intermediary files used during preprocessing by either

  1. Having an option to remove the output files for any preprocessing task (before training begins)
  2. Only specify one input/output file pair and have all preprocessing take place in xnmt without writing intermediate result to disk

For instance, one in one of my experiments I need to:

  1. filter out empty sentences
  2. lowercase
  3. tokenize
  4. filter out sentences that are too long (necessary after tokenization sometimes)

Which creates 2-6 files each time ([train+dev+test]*2). For big corpora this is a bit wasteful.

pmichel31415 avatar Apr 25 '18 02:04 pmichel31415

I think this is a good idea, but the only thing that makes it complicated is the interaction with the fact that preprocessing can avoid re-generating files when they already exist, which saves time on multiple runs. If the files are deleted, they'll be re-generated. I wonder if there's a good design about how to handle this without adding too much complexity?

neubig avatar Jun 22 '18 11:06 neubig