Optimize MuST-C preprocessing script

Open gegallego opened this issue 2 years ago • 0 comments

Before submitting

[ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
[X] Did you read the contributor guideline?
[ ] Did you make sure to update the docs?
[ ] Did you write any new necessary tests?

What does this PR do?

I found the preprocessing script for MuST-C with --use-audio-input is really slow. Many processes are based on a massive for-loop, without any kind of parallelization. So, I decided to improve this script in different points of the code. Some of these also improve, indirectly, other speech-to-text preprocessing scripts.

Here, I list the improvements:

YAML Loader

I changed the loader from BaseLoader to CBaseLoader, which reduces from 1:30 minutes to less than 10 seconds.

This improvement also applies the MuST-C preprocessing without --use-audio-input.

Audio file conversion + saving to FLAC

This is by far the longest process in the script. After 1 hour, it just converted the 5% of the files in the en-de train split. After parallelizing it, using 16 CPUs, in 2 hours I converted the whole en-de split.

Zip file creation

Zipping the converted audio files also takes a long time with the current code (>20 minutes). After parallelizing it, using 16 CPUs, it can be done in around 3 minutes.

This improves the create_zip from data_utils.py, and hence it also optimizes all the preprocessing scripts using it.

Zip manifest

The current get_zip_manifest function takes around 8 minutes to execute. After parallelizing it, using 16 CPUs, it runs in less than 1 minute.

This improves the get_zip_manifest from data_utils.py, and hence it also optimizes all the preprocessing scripts using it.

TSV manifest generation

Parallelizing this process with 16 CPUs, the execution time is reduced from around 7 minutes to 1-2 minutes.

This improvement also applies the MuST-C preprocessing without --use-audio-input.

Oct 11 '22 22:10 gegallego

fairseq fairseq copied to clipboard

Optimize MuST-C preprocessing script

Before submitting

What does this PR do?

YAML Loader

Audio file conversion + saving to FLAC

Zip file creation

Zip manifest

TSV manifest generation

fairseq
fairseq copied to clipboard