snowfall icon indicating copy to clipboard operation
snowfall copied to clipboard

prepare.py slow speed

Open danpovey opened this issue 4 years ago • 12 comments

I have been running some experiments to debug why prepare.py is so slow. I'll post the results here in a bit, but let me just give you a preview of why I believe it's so slow. By default, torch seems to use a lot of threads, but it does so in a very inefficient way which actually just wastes CPU. We attempt to get around this with:

  torch.set_num_threads(1)
  torch.set_num_interop_threads(1)

but now that we are using "spawn" as the parallelization method and the main() is guarded by:

if __name__ == '__main__':
    main()

that code is never reached.

danpovey avatar Nov 16 '20 07:11 danpovey

Here are my notes from investigating the time to prepare dev-clean, including writing llc files. The script prepare.py was modified to skip the prepare_librispeech set and just load the relevant manifest from disk so the time is just the time taken in cut_set.compute_and_store_features().

Below I show various setups. TL:DR: the following statements:

torch.set_num_threads(1)
torch.set_num_interop_threads(1)

need to be moved to OUTSIDE OF MAIN() for a 100-fold speed imrovement. This should be done regardless of the num-jobs or execution setup.

with executor = None; torch.set_num_threads(1);  torch.set_num_interop_threads(1):
       real 1m17, user: 0m51, sys: 0m22

with executor = None; and no torch.setXXX, which means it uses about 25 CPUs:
        real    1m48, user: 45m45, sys: 1m6

with executor = ProcessPoolExecutor(num_jobs, mp_context=multiprocessing.get_context("spawn")),
  [in main.py: torch.set_num_threads(1);  torch.set_num_interop_threads(1)]; num_jobs = 1:
      Uses about 25 CPUs within 1 thread:
      real 1m35, user 39m30, sys 1m16.

with executor = ProcessPoolExecutor(num_jobs, mp_context=multiprocessing.get_context("spawn")),
  [in main.py: torch.set_num_threads(10);  torch.set_num_interop_threads(1)]; num_jobs = 10:
    Each thread uses about 500% CPU.
    real    10m16.768s,  user   299m8.986s, sys 2m26.761s

with executor = ProcessPoolExecutor(num_jobs, mp_context=multiprocessing.get_context("spawn")),
  [outside main.py: torch.set_num_threads(10);  torch.set_num_interop_threads(1)]; num_jobs = 10:
    Each thread uses about 500% CPU.
    real    real    0m13.860s,   user   1m8.146s, sys   0m44.639s

danpovey avatar Nov 16 '20 07:11 danpovey

Also notice there is about a factor of 300 in speed between the last 2 runs, if you look at the user time taken (i.e. total across CPUs).

danpovey avatar Nov 16 '20 07:11 danpovey

Let me try it under your suggestion and make a PR to fix it.

jimbozhang avatar Nov 16 '20 08:11 jimbozhang

Yes, it is much faster! 👍

Use 48 cores on ('dev-clean', 'test-clean'):

real    0m47.141s
user    3m11.562s
sys     1m39.085s

Use 72 cores on ('dev-clean', 'test-clean', 'train-clean-100'):

real    6m20.349s
user    32m31.692s
sys     7m57.317s

It is slower than Dan's result, I think it should be caused by the NFS delay of my home directory.

jimbozhang avatar Nov 16 '20 08:11 jimbozhang

🎉

pzelasko avatar Nov 16 '20 14:11 pzelasko

I have disabled the OpenMP support for libsox on torchaudio side. https://github.com/pytorch/audio/issues/1026 It will be available through nightly binary distribution starting tomorrow. (Please check out pytorch.org for the installation of nightly build)

This should remove the suspected cause of crashing (GNU OpenMP and Intel OpenMP co-exist and interfere), and will allow you to go back to the fork method without being worried about OMP_NUM_THREADS configuration (or torch.set_num_threads).

We have minor release scheduled in the early December, so we appreciate if you can test it and confirm it resolves the issue (or not). https://github.com/pytorch/audio/issues/1021

mthrok avatar Nov 20 '20 17:11 mthrok

@jimbozhang would you mind testing this? (I think what is required is testing that the speed is ok after you revert the "spawn"->"fork" and remove the stuff about setting torch num_threads).

danpovey avatar Nov 21 '20 03:11 danpovey

@jimbozhang would you mind testing this? (I think what is required is testing that the speed is ok after you revert the "spawn"->"fork" and remove the stuff about setting torch num_threads).

OK, I'll work on this (and k2 NVTX stuffs) next Monday. @danpovey

jimbozhang avatar Nov 21 '20 04:11 jimbozhang

With the latest version's k2+lhotse+snowfall, under the path snowfall/egs/librispeech/asr/simple_v1:

$ time python prepare.py
dev-clean
Extracting and storing features in parallel: 100%|_________| 2703/2703 [00:07<00:00, 364.28it/s]

test-clean
Extracting and storing features in parallel: 100%|_________| 2620/2620 [00:07<00:00, 342.19it/s]

train-clean-100
Extracting and storing features in parallel: 100%|_________| 28539/28539 [01:26<00:00, 331.35it/s]

real    3m41.975s
user    33m55.149s
sys     5m25.221s

Didn't modify the job number setting, so the job number num_jobs = min(15, os.cpu_count()) should be 15.

jimbozhang avatar Nov 24 '20 06:11 jimbozhang

I think to test what he's asking about, you need to install the latest version of TorchAudio (available through the nightly builds)... at least I think that is what he was saying. You might need to install PyTorch 1.7.0 in order to do that. Perhaps you should talk to Haowen about this to-morrow. On our test machine we have cuda 10.1 and you are probably using PyTorch 1.6.0, so you'll have to see whether there

Also you'd need to change the snowfall prepare.py code to use "fork" instead of "spawn", and probably remove the 2 statements setting torch's num-threads to 0, to verify that it still works the same after those changes.

On Tue, Nov 24, 2020 at 3:09 PM Junbo Zhang [email protected] wrote:

With the latest version's k2+lhotse+snowfall, under the path snowfall/egs/librispeech/asr/simple_v1:

$ time python prepare.py

dev-clean Extracting and storing features in parallel: 100%|_________| 2703/2703 [00:07<00:00, 364.28it/s]

test-clean Extracting and storing features in parallel: 100%|_________| 2620/2620 [00:07<00:00, 342.19it/s]

train-clean-100 Extracting and storing features in parallel: 100%|_________| 28539/28539 [01:26<00:00, 331.35it/s]

real 3m41.975s user 33m55.149s sys 5m25.221s

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/18#issuecomment-732693212, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO625JHOMXUVMTN4UGDSRNKS3ANCNFSM4TWZ42KA .

danpovey avatar Nov 24 '20 15:11 danpovey

Ah, apologies. I misunderstood it. I didn't notice mthrok's comment at all :sweat: and just simply tested the latest snowfall. Sorry about that ...

I'll try torchaudio with fork method tomorrow.

jimbozhang avatar Nov 24 '20 17:11 jimbozhang

I tested it. The snowfall/egs/librispeech/asr/simple_v1/prepare.py worked fine when I just replaced "spawn" with "fork", with torchaudio (0.8.0.dev20201125) + torch (1.8.0.dev20201126+cu101).

But after I removed the torch.set_num_threads settings, it became extremely slow, same as before. @mthrok @danpovey

jimbozhang avatar Nov 26 '20 13:11 jimbozhang