xnmt Reduce intermediary files during preprocessing

Reduce intermediary files during preprocessing

Open pmichel31415 opened this issue 7 years ago • 1 comments

trafficstars

It would be nice to have a way to reduce the number of intermediary files used during preprocessing by either

Having an option to remove the output files for any preprocessing task (before training begins)
Only specify one input/output file pair and have all preprocessing take place in xnmt without writing intermediate result to disk

For instance, one in one of my experiments I need to:

filter out empty sentences
lowercase
tokenize
filter out sentences that are too long (necessary after tokenization sometimes)

Which creates 2-6 files each time ([train+dev+test]*2). For big corpora this is a bit wasteful.

Apr 25 '18 02:04 pmichel31415

I think this is a good idea, but the only thing that makes it complicated is the interaction with the fact that preprocessing can avoid re-generating files when they already exist, which saves time on multiple runs. If the files are deleted, they'll be re-generated. I wonder if there's a good design about how to handle this without adding too much complexity?

Jun 22 '18 11:06 neubig

xnmt xnmt copied to clipboard

Reduce intermediary files during preprocessing

xnmt
xnmt copied to clipboard