fairseq
fairseq copied to clipboard
Optimize MuST-C preprocessing script
Before submitting
- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [X] Did you read the contributor guideline?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?
What does this PR do?
I found the preprocessing script for MuST-C with --use-audio-input
is really slow. Many processes are based on a massive for-loop, without any kind of parallelization. So, I decided to improve this script in different points of the code. Some of these also improve, indirectly, other speech-to-text preprocessing scripts.
Here, I list the improvements:
YAML Loader
I changed the loader from BaseLoader
to CBaseLoader
, which reduces from 1:30 minutes to less than 10 seconds.
This improvement also applies the MuST-C preprocessing without --use-audio-input
.
Audio file conversion + saving to FLAC
This is by far the longest process in the script. After 1 hour, it just converted the 5% of the files in the en-de train split. After parallelizing it, using 16 CPUs, in 2 hours I converted the whole en-de split.
Zip file creation
Zipping the converted audio files also takes a long time with the current code (>20 minutes). After parallelizing it, using 16 CPUs, it can be done in around 3 minutes.
This improves the create_zip
from data_utils.py
, and hence it also optimizes all the preprocessing scripts using it.
Zip manifest
The current get_zip_manifest
function takes around 8 minutes to execute. After parallelizing it, using 16 CPUs, it runs in less than 1 minute.
This improves the get_zip_manifest
from data_utils.py
, and hence it also optimizes all the preprocessing scripts using it.
TSV manifest generation
Parallelizing this process with 16 CPUs, the execution time is reduced from around 7 minutes to 1-2 minutes.
This improvement also applies the MuST-C preprocessing without --use-audio-input
.