STT icon indicating copy to clipboard operation
STT copied to clipboard

Feature request: Allow for easy transcript processing at training time

Open JRMeyer opened this issue 2 years ago • 3 comments

Scenario: I have a large dataset where all transcripts are in ALL CAPS, but the alphabet I want to use (i.e. fine-tune v1.2.0) is in lower case.

Current solution: I can either (1) make a copy of the dataset, (2) drop the last layer of the source model for transfer learning, or (3) make changes to the source code of coqui_stt_training

Desired solution: pass a command directly to the training module, for example

passing some Unix text command:

$ python -m coqui_stt_training.train --clean_text "tr '[:upper:]' '[:lower:]'"

passing some python:

$ python -m coqui_stt_training.train --clean_text ".lower()"

JRMeyer avatar Feb 10 '22 00:02 JRMeyer

Command line flags are not a suitable UI for plugging arbitrary code, or arbitrary shell pipes. This would have to be an API.

reuben avatar Feb 10 '22 10:02 reuben

You're thinking something like --lowercase true?

JRMeyer avatar Feb 10 '22 18:02 JRMeyer

Or an API where you provide a Python function that gets run inside the feeding pipeline.

reuben avatar Feb 10 '22 21:02 reuben