transformers
transformers copied to clipboard
Allow user-managed Pool in Wav2Vec2ProcessorWithLM.batch_decode
What does this PR do?
There are two issues being attacked:
- Always creating a fresh pool within
Wav2Vec2ProcessorWithLM.batch_decodegenerates a big overhead if it's called multiple times (this PR fixes #17879) pyctcdecodecan't usespawnPools (this PR supersedes #17070)
Changes:
- adds a
poolargument toWav2Vec2ProcessorWithLM.batch_decode. This allows a user-managedmultiprocessing.Poolto be (re)used across multiple calls tobatch_decode. - updates
pyctcdecodeversion requirement. The new version contains code to handle invalid pools. - adds some tests for TF's, torch's and flax's
Wav2Vec2ProcessorWithLM. - adds usage example in
Wav2Vec2ProcessorWithLM.batch_decode's docs.
An important implementation reference is multiprocessing's Contexts and start methods. Basically, batch_decode's multiprocessing capabilities are useful only in Unix, which uses fork contexts. This PR introduces some checks in this regard. They can be removed once https://github.com/kensho-technologies/pyctcdecode/issues/65 is resolved.
Breaking change
The new pool argument can break currently valid codes like: processor.batch_decode(logits, 5). Previously, the second argument meant num_processes. If that's an issue, some considerations are:
poolandnum_processesare mutually exclusive, but a unique arg likenum_processes_or_poolseemed weird- we could add
poolas last argument - we could force kwargs-only
Checklist
I couldn't install all deps, so I didn't execute some tests and I didn't build the documentation changes. Let's see what CI shows about:
- [x] ~Test
test_decoder_batch_1fromtest_processor_wav2vec2_with_lm.py(it usesfork, but it fails in my Mac, probably because of the OS platform behavior)~ Fixed in 17efdddbeeac75eeacddff3bcc51ced70ec19217 - [ ] All other tests
- [ ]
Wav2Vec2ProcessorWithLM.batch_decode's new Tips - [ ]
Wav2Vec2ProcessorWithLM.batch_decode's new usage example
Also, the new tests are a copy and paste of previous tests. I couldn't figure out how to factor out the duplicated code (I'm more used to pytest's fixtures). Ideas to cut down the duplicated code are welcomed.
After merge
Once merged, it could be nice to adapt and re-run some scripts such as evaluation of patrickvonplaten/wav2vec2-large-960h-lv60-self-4-gram. In this case, for example, there are two improvements:
- Current evaluation creates a fresh Pool for each inferred audio, yielding a huge overhead when compared to a sequential decoding or to a user-managed pool as introduced in this PR.
- Current evaluation uses
Wav2Vec2ProcessorWithLM.batch_decodewithoutbatched=True, which probably means that there's no advantage in using parallel batch decoding. The usage example from this PR allows proper parallel batch decoding.
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Was this discussed/approved via a Github issue or the forum? See:
- #17879
- #17070.
- [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [x] Did you write any new necessary tests?
Who can review?
@patrickvonplaten
The documentation is not available anymore as the PR was closed or merged.
@anton-l are you looking into this? I'm not super familiar with the pyctcdecode backend, but can get up to speed and take a look for @falcaopetri if need be :)
Hi @falcaopetri! Looks like test_decoder_batch_1 indeed fails on Linux too, maybe you could take a look and find a workaround?
Hi everyone. Test setup was wrong, but new commit's diff should be self-explanatory. Let me know if there's something still missing.
You can also check this colab. 34s -> 5s speed-up by just removing the overhead, even without using >1 processes, which should have a somewhat linear improvement when decoding multiple batches.
Hi @anton-l, @sanchit-gandhi. I've merged main trying to fix the Add model like runner CI step, but with no success. Could you take a look on that, and the overall PR please?
This PR looks very nice to me! Let's try to get it merged
@falcaopetri let's see what the tests say and then IMO we can merge if they are all green
Thanks for re-opening it @patrickvonplaten! I've just fixed a code quality issue that made check_code_quality fail. But now CircleCI seems to be having some (internal?) problem.
And should I rebase everything to get a nice and clean history or it's not necessary?
@patrickvonplaten, I've rebased everything so we (i) get a cleaner merge and (ii) re-trigger CircleCI. Unfortunately, CircleCI setup is still failing.
It seems there is an issue with your CircleCI permissions, the tests won't run. Could you try refreshing your permissions as shown here
Oh I see, I didn't realize CircleCI was talking about my own user's credential. Thanks for the the heads up. I've refreshed my credentials and force pushed things again to trigger CI steps.
For future reference, I did make a mistake in the process: after refreshing credentials, CircleCI prompted me to set up a project within my own organization, which then made the run_test* steps fail with "Resource class docker for xlarge is not available for your project, or is not a valid resource class. This message will often appear if the pricing plan for this project does not support docker use." Deleting my own project and force pushing made the huggingface org run the steps.
I just realized that my Wav2Vec2ProcessorWithLM.batch_decode's Example triggers:
WARNING:datasets.fingerprint:Parameter 'fn_kwargs'={'pool': <multiprocessing.pool.Pool object at 0x7f64ecc54990>} of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
@patrickvonplaten, is there a way to indicate that a map arg shouldn't be taken into account when hashing the transformation? Or maybe pool could be a global variable?
Uff yeah good question. Gently pinging @lhoestq and/or @mariosasko here as this seems to be datasets related
On the other hand, I don't think it's a must that results are cached with datasets here - think we can merge this PR without.
@patrickvonplaten, is there a way to indicate that a map arg shouldn't be taken into account when hashing the transformation? Or maybe pool could be a global variable?
You can pass new_fingerprint= in map with a unique fingerprint (string) that depends on the parameters of your transform.
Alternatively you can just disable caching in datasets to remove the warning
Thanks again for all your work on this!
Hi @falcaopetri Thank you for adding this! After merging to the main branch, we have 3 test failures (when running on GPU)
tests/models/wav2vec2/test_modeling_tf_wav2vec2.py::TFWav2Vec2ModelIntegrationTest::test_wav2vec2_with_lm_invalid_pool
(line 243) RuntimeError: context has already been set
tests/models/wav2vec2/test_modeling_wav2vec2.py::Wav2Vec2ModelIntegrationTest::test_wav2vec2_with_lm_invalid_pool
(line 243) RuntimeError: context has already been set
tests/models/wav2vec2/test_modeling_wav2vec2.py::Wav2Vec2ModelIntegrationTest::test_wav2vec2_with_lm_pool
(line 1639) TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
More detailed information could be found here and its raw logs.
Would you like to take a look here 🙏? Thank you 🤗
Hi @ydshieh. I'm really sorry about that. Both errors are my fault while setting up the tests.
RuntimeError: context has already been set emerged only during GPU tests probably because tests are executed differently: a PR's pytest is invoked as with -n 8, while your tests here aren't. I.e., RuntimeError was indeed an issue, but it was hidden during the PR because tests were being executed in different processes.
TypeError: can't convert cuda:0 device type tensor to numpy was caused by a strange change I did at the time after copy and pasting other tests.
This should fix both issues: https://github.com/huggingface/transformers/compare/main...falcaopetri:transformers:fix-w2v2-lm-pool. What is your workflow in this case? Should I open a new PR?
Thank you so much @falcaopetri. Yes, it would be nice if you can open a PR. Otherwise, we can do it ourselves by looking your branch above ❤️ .
[Context of our CI] Our CircleCI tests run on CPU, so we use -n 8. After merging into main, we run (more) tests and on GPU machines, so we set -n 1 to avoid multiple processes to access GPU at the same time.