What does this PR do?

There are two issues being attacked:

Always creating a fresh pool within Wav2Vec2ProcessorWithLM.batch_decode generates a big overhead if it's called multiple times (this PR fixes #17879)
pyctcdecode can't use spawn Pools (this PR supersedes #17070)

Changes:

adds a pool argument to Wav2Vec2ProcessorWithLM.batch_decode. This allows a user-managed multiprocessing.Pool to be (re)used across multiple calls to batch_decode.
updates pyctcdecode version requirement. The new version contains code to handle invalid pools.
adds some tests for TF's, torch's and flax's Wav2Vec2ProcessorWithLM.
adds usage example in Wav2Vec2ProcessorWithLM.batch_decode's docs.

An important implementation reference is multiprocessing's Contexts and start methods. Basically, batch_decode's multiprocessing capabilities are useful only in Unix, which uses fork contexts. This PR introduces some checks in this regard. They can be removed once https://github.com/kensho-technologies/pyctcdecode/issues/65 is resolved.

Breaking change

The new pool argument can break currently valid codes like: processor.batch_decode(logits, 5). Previously, the second argument meant num_processes. If that's an issue, some considerations are:

pool and num_processes are mutually exclusive, but a unique arg like num_processes_or_pool seemed weird
we could add pool as last argument
we could force kwargs-only

Checklist

I couldn't install all deps, so I didn't execute some tests and I didn't build the documentation changes. Let's see what CI shows about:

[x] ~Test test_decoder_batch_1 from test_processor_wav2vec2_with_lm.py (it uses fork, but it fails in my Mac, probably because of the OS platform behavior)~ Fixed in 17efdddbeeac75eeacddff3bcc51ced70ec19217
[ ] All other tests
[ ] Wav2Vec2ProcessorWithLM.batch_decode's new Tips
[ ] Wav2Vec2ProcessorWithLM.batch_decode's new usage example

Also, the new tests are a copy and paste of previous tests. I couldn't figure out how to factor out the duplicated code (I'm more used to pytest's fixtures). Ideas to cut down the duplicated code are welcomed.

After merge

Once merged, it could be nice to adapt and re-run some scripts such as evaluation of patrickvonplaten/wav2vec2-large-960h-lv60-self-4-gram. In this case, for example, there are two improvements:

Current evaluation creates a fresh Pool for each inferred audio, yielding a huge overhead when compared to a sequential decoding or to a user-managed pool as introduced in this PR.
Current evaluation uses Wav2Vec2ProcessorWithLM.batch_decode without batched=True, which probably means that there's no advantage in using parallel batch decoding. The usage example from this PR allows proper parallel batch decoding.

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? See:
- #17879
- #17070.
[x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[x] Did you write any new necessary tests?

Who can review?

@patrickvonplaten

Jul 29 '22 04:07 falcaopetri

The documentation is not available anymore as the PR was closed or merged.

Jul 29 '22 04:07 HuggingFaceDocBuilderDev

@anton-l are you looking into this? I'm not super familiar with the pyctcdecode backend, but can get up to speed and take a look for @falcaopetri if need be :)

Aug 08 '22 15:08 sanchit-gandhi

Hi @falcaopetri! Looks like test_decoder_batch_1 indeed fails on Linux too, maybe you could take a look and find a workaround?

Aug 09 '22 14:08 anton-l

Hi everyone. Test setup was wrong, but new commit's diff should be self-explanatory. Let me know if there's something still missing.

You can also check this colab. 34s -> 5s speed-up by just removing the overhead, even without using >1 processes, which should have a somewhat linear improvement when decoding multiple batches.

Aug 13 '22 21:08 falcaopetri

Hi @anton-l, @sanchit-gandhi. I've merged main trying to fix the Add model like runner CI step, but with no success. Could you take a look on that, and the overall PR please?

Aug 23 '22 11:08 falcaopetri

This PR looks very nice to me! Let's try to get it merged

Sep 27 '22 11:09 patrickvonplaten

@falcaopetri let's see what the tests say and then IMO we can merge if they are all green

Sep 27 '22 11:09 patrickvonplaten

Thanks for re-opening it @patrickvonplaten! I've just fixed a code quality issue that made check_code_quality fail. But now CircleCI seems to be having some (internal?) problem.

And should I rebase everything to get a nice and clean history or it's not necessary?

Sep 28 '22 01:09 falcaopetri

@patrickvonplaten, I've rebased everything so we (i) get a cleaner merge and (ii) re-trigger CircleCI. Unfortunately, CircleCI setup is still failing.

Sep 29 '22 12:09 falcaopetri

It seems there is an issue with your CircleCI permissions, the tests won't run. Could you try refreshing your permissions as shown here

Sep 29 '22 19:09 patrickvonplaten

Oh I see, I didn't realize CircleCI was talking about my own user's credential. Thanks for the the heads up. I've refreshed my credentials and force pushed things again to trigger CI steps.

For future reference, I did make a mistake in the process: after refreshing credentials, CircleCI prompted me to set up a project within my own organization, which then made the run_test* steps fail with "Resource class docker for xlarge is not available for your project, or is not a valid resource class. This message will often appear if the pricing plan for this project does not support docker use." Deleting my own project and force pushing made the huggingface org run the steps.

Sep 30 '22 13:09 falcaopetri

I just realized that my Wav2Vec2ProcessorWithLM.batch_decode's Example triggers: WARNING:datasets.fingerprint:Parameter 'fn_kwargs'={'pool': <multiprocessing.pool.Pool object at 0x7f64ecc54990>} of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

@patrickvonplaten, is there a way to indicate that a map arg shouldn't be taken into account when hashing the transformation? Or maybe pool could be a global variable?

Sep 30 '22 13:09 falcaopetri

Uff yeah good question. Gently pinging @lhoestq and/or @mariosasko here as this seems to be datasets related

Oct 05 '22 10:10 patrickvonplaten

On the other hand, I don't think it's a must that results are cached with datasets here - think we can merge this PR without.

Oct 05 '22 10:10 patrickvonplaten

@patrickvonplaten, is there a way to indicate that a map arg shouldn't be taken into account when hashing the transformation? Or maybe pool could be a global variable?

You can pass new_fingerprint= in map with a unique fingerprint (string) that depends on the parameters of your transform.

Alternatively you can just disable caching in datasets to remove the warning

Oct 05 '22 13:10 lhoestq

Thanks again for all your work on this!

Oct 18 '22 12:10 sgugger

Hi @falcaopetri Thank you for adding this! After merging to the main branch, we have 3 test failures (when running on GPU)

tests/models/wav2vec2/test_modeling_tf_wav2vec2.py::TFWav2Vec2ModelIntegrationTest::test_wav2vec2_with_lm_invalid_pool
(line 243)  RuntimeError: context has already been set
tests/models/wav2vec2/test_modeling_wav2vec2.py::Wav2Vec2ModelIntegrationTest::test_wav2vec2_with_lm_invalid_pool
(line 243)  RuntimeError: context has already been set
tests/models/wav2vec2/test_modeling_wav2vec2.py::Wav2Vec2ModelIntegrationTest::test_wav2vec2_with_lm_pool
(line 1639)  TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

More detailed information could be found here and its raw logs.

Would you like to take a look here 🙏? Thank you 🤗

Oct 20 '22 09:10 ydshieh

Hi @ydshieh. I'm really sorry about that. Both errors are my fault while setting up the tests.

RuntimeError: context has already been set emerged only during GPU tests probably because tests are executed differently: a PR's pytest is invoked as with -n 8, while your tests here aren't. I.e., RuntimeError was indeed an issue, but it was hidden during the PR because tests were being executed in different processes.

TypeError: can't convert cuda:0 device type tensor to numpy was caused by a strange change I did at the time after copy and pasting other tests.

This should fix both issues: https://github.com/huggingface/transformers/compare/main...falcaopetri:transformers:fix-w2v2-lm-pool. What is your workflow in this case? Should I open a new PR?

Oct 21 '22 00:10 falcaopetri

Thank you so much @falcaopetri. Yes, it would be nice if you can open a PR. Otherwise, we can do it ourselves by looking your branch above ❤️ .

[Context of our CI] Our CircleCI tests run on CPU, so we use -n 8. After merging into main, we run (more) tests and on GPU machines, so we set -n 1 to avoid multiple processes to access GPU at the same time.

Oct 21 '22 08:10 ydshieh

transformers
transformers copied to clipboard

Allow user-managed Pool in Wav2Vec2ProcessorWithLM.batch_decode

What does this PR do?

Breaking change

Checklist

After merge

Before submitting

Who can review?

transformers transformers copied to clipboard

Allow user-managed Pool in Wav2Vec2ProcessorWithLM.batch_decode

What does this PR do?

Breaking change

Checklist

After merge

Before submitting

Who can review?

transformers
transformers copied to clipboard