NeMo
NeMo copied to clipboard
Timestamps to transcribe
What does this PR do ?
Adds support for extracting timestamps to .transcribe() method
Collection: ASR
Changelog
-
Add timestamps=True/False to .transcribe() method in mixin
-
Adds corresponding support in
- ctc_models.py
- rnnt_models.py
- hybrid_rnnt-ctc_models.py
- Raise a notimplemented error for AED Based Models (Canary) -Adds support to transcribe_speech.py
- merges two variables to one: (compute_timestamps, preserve_alignments -> timestamps) as both are mutually dependent
- cleans much of the code
-
Add optional verbose=True option to change_decoding_strategy method
-
TODO:
- unit tests
Usage
From command-line
with transcribe_speech.py script
python transcribe_speech.py pretrained_name="nvidia/parakeet-ctc-1.1b.nemo" \
dataset_manifest=<manifest_path> \
output_filename=<output_filename> timestamps=True
From Python Env
For CTC based models
from nemo.collections.asr.models import ASRModel
ctc_model = ASRModel.from_pretrained('nvidia/parakeet-ctc-1.1b')
output=ctc_model.transcribe(['<file_path>'], timestamps=True) # or manifest instead of individual filepaths
# by default you get timestamps for char, word and segment level. segment level differs based on model you use if it support punctuations and capitalizations natively or not.
# for word-level timestamps
print(output[0].timestep['word'][:10]) #prints first 10 timestamps *_offset corresponds to frame numbers and start and end are provided in seconds
# for segment-level timestamps
print(output[0].timestep['segment'][:10])
For RNNT/TDT based models
(currently only difference is output type for both models, will be making it consistent in upcoming PR)
from nemo.collections.asr.models import ASRModel
transducer_model = ASRModel.from_pretrained('nvidia/parakeet-rnnt-1.1b')
output=transducer_model.transcribe(['<file_path>'], timestamps=True)
# for word-level timestamps
print(output[0][0].timestep['word'][:10])
# for segment-level timestamps
print(output[0][0].timestep['segment'][:10])
For Hybrid RNNT/TDT-CTC models
Same as above by default decoding would be with transducer (RNNT/TDT), if user wants to change decoder then change decoding strategy before running transcribe() like:
from nemo.collections.asr.models import ASRModel
from nemo.collections.asr.parts.submodules.ctc_decoding import CTCDecodingConfig
hybrid_model = ASRModel.from_pretrained('nvidia/parakeet-tdt_ctc-110m')
ctc_cfg = CTCDecodingConfig()
ctc_cfg.decoding = "greedy_batch"
hybrid_model.change_decoding_strategy(decoding_cfg=ctc_cfg, decoder_type="ctc")
output=hybrid_model.transcribe(['<file_path>'], timestamps=True)
# for word-level timestamps
print(output[0].timestep['word'][:10])
# for segment-level timestamps
print(output[0].timestep['segment'][:10])
For AED Models
For AED models like Canary, support would be added soon.
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR. To re-run CI remove and add the label again. To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
- [ ] Make sure you read and followed Contributor guidelines
- [ ] Did you write any new necessary tests?
- [ ] Did you add or update any necessary documentation?
- [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?
PR Type:
- [x] New Feature
- [ ] Bugfix
- [x] Documentation
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
- Related to # (issue)