transfer-learning-conv-ai icon indicating copy to clipboard operation
transfer-learning-conv-ai copied to clipboard

Upgrade get_dataset.tokenize() to multiprocessing

Open DrStoop opened this issue 5 years ago • 4 comments

get_dataset.tokenize() on a single CPU is very slow. Therefore in this pull request it is upgraded to multiprocessing by implementing the multiprocessing target function worker_tokenize(args_list). Additionally a multiprocessing debug logger mp_logger was added together with logger.debug() and mp_logger.debug() message to track progress in the python console.

DrStoop avatar Aug 20 '19 03:08 DrStoop

Looks nice, thanks!

thomwolf avatar Aug 20 '19 10:08 thomwolf

The question would be, if multiprocessing module should be added to requirements.txt?

DrStoop avatar Aug 20 '19 14:08 DrStoop

@thomwolf , please could we get this merged? Thank you.

martinritchie avatar Sep 18 '19 14:09 martinritchie

@thomwolf, before merging: i did some work on parallelizing the complete preprocessing chain affecting quite some code in ‚train.py‘ and ‚utils.py‘. i could clean the code & create a new pull request with e.g. 2 new files ‚utils_multiprocessing.py‘ and ‚train_multiprocessing.py‘. This way merging would become very easy & backward compatibility for everybody is guaranteed. Just let me know if you have interest in merging such a speedup :fast_forward: :dash:

DrStoop avatar Sep 18 '19 22:09 DrStoop