transfer-learning-conv-ai
transfer-learning-conv-ai copied to clipboard
Upgrade get_dataset.tokenize() to multiprocessing
get_dataset.tokenize() on a single CPU is very slow. Therefore in this pull request it is upgraded to multiprocessing by implementing the multiprocessing target function worker_tokenize(args_list). Additionally a multiprocessing debug logger mp_logger was added together with logger.debug() and mp_logger.debug() message to track progress in the python console.
Looks nice, thanks!
The question would be, if multiprocessing
module should be added to requirements.txt?
@thomwolf , please could we get this merged? Thank you.
@thomwolf, before merging: i did some work on parallelizing the complete preprocessing chain affecting quite some code in ‚train.py‘ and ‚utils.py‘. i could clean the code & create a new pull request with e.g. 2 new files ‚utils_multiprocessing.py‘ and ‚train_multiprocessing.py‘. This way merging would become very easy & backward compatibility for everybody is guaranteed. Just let me know if you have interest in merging such a speedup :fast_forward: :dash: