Timestamps to transcribe

Open nithinraok opened this issue 4 months ago • 0 comments

What does this PR do ?

Adds support for extracting timestamps to .transcribe() method

Collection: ASR

Changelog

Add timestamps=True/False to .transcribe() method in mixin
Adds corresponding support in
- ctc_models.py
- rnnt_models.py
- hybrid_rnnt-ctc_models.py
- Raise a notimplemented error for AED Based Models (Canary) -Adds support to transcribe_speech.py
- merges two variables to one: (compute_timestamps, preserve_alignments -> timestamps) as both are mutually dependent
- cleans much of the code
Add optional verbose=True option to change_decoding_strategy method
TODO:
- unit tests

Usage

From command-line

with transcribe_speech.py script

python transcribe_speech.py pretrained_name="nvidia/parakeet-ctc-1.1b.nemo" \
dataset_manifest=<manifest_path> \
output_filename=<output_filename> timestamps=True

From Python Env

For CTC based models

from nemo.collections.asr.models import ASRModel
ctc_model = ASRModel.from_pretrained('nvidia/parakeet-ctc-1.1b')
output=ctc_model.transcribe(['<file_path>'], timestamps=True) # or manifest instead of individual filepaths
# by default you get timestamps for char, word and segment level. segment level differs based on model you use if it support punctuations and capitalizations natively or not. 
# for word-level timestamps
print(output[0].timestep['word'][:10]) #prints first 10 timestamps *_offset corresponds to frame numbers and start and end are provided in seconds 
# for segment-level timestamps
print(output[0].timestep['segment'][:10])

For RNNT/TDT based models

(currently only difference is output type for both models, will be making it consistent in upcoming PR)

from nemo.collections.asr.models import ASRModel
transducer_model = ASRModel.from_pretrained('nvidia/parakeet-rnnt-1.1b')
output=transducer_model.transcribe(['<file_path>'], timestamps=True)
# for word-level timestamps
print(output[0][0].timestep['word'][:10]) 
# for segment-level timestamps
print(output[0][0].timestep['segment'][:10])

For Hybrid RNNT/TDT-CTC models

Same as above by default decoding would be with transducer (RNNT/TDT), if user wants to change decoder then change decoding strategy before running transcribe() like:

from nemo.collections.asr.models import ASRModel
from nemo.collections.asr.parts.submodules.ctc_decoding import CTCDecodingConfig
hybrid_model = ASRModel.from_pretrained('nvidia/parakeet-tdt_ctc-110m')
ctc_cfg = CTCDecodingConfig()
ctc_cfg.decoding = "greedy_batch"
hybrid_model.change_decoding_strategy(decoding_cfg=ctc_cfg, decoder_type="ctc")
output=hybrid_model.transcribe(['<file_path>'], timestamps=True)
# for word-level timestamps
print(output[0].timestep['word'][:10])
# for segment-level timestamps
print(output[0].timestep['segment'][:10])

For AED Models

For AED models like Canary, support would be added soon.

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR. To re-run CI remove and add the label again. To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

[ ] Make sure you read and followed Contributor guidelines
[ ] Did you write any new necessary tests?
[ ] Did you add or update any necessary documentation?
[ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[x] New Feature
[ ] Bugfix
[x] Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Oct 18 '24 21:10 nithinraok

NeMo NeMo copied to clipboard

Timestamps to transcribe

What does this PR do ?

Changelog

Usage

From command-line

with transcribe_speech.py script

From Python Env

For CTC based models

For RNNT/TDT based models

For Hybrid RNNT/TDT-CTC models

For AED Models

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

NeMo
NeMo copied to clipboard