Whisper icon indicating copy to clipboard operation
Whisper copied to clipboard

Can you add a function to transcribe multiple files or even transcribe entire folder?

Open kiron111 opened this issue 1 year ago • 17 comments

Because I have to transcribe multiple audio/ video files each time, it's not so convenient to click and wait a file to finish.

Thank you for your development of such a great software!!!

kiron111 avatar Mar 18 '23 19:03 kiron111

Right! I just wanted to post that, found you just issued that!

If GUI multiple file function is not easy, command line usage is also a good alternative, so that users can write scripts to use it (handling multiple files by order).

HaujetZhao avatar Mar 18 '23 20:03 HaujetZhao

@kiron111 I think you can do that with the new PowerShell wrapper. Here’s how.

  1. Create the folder %USERPROFILE%\Documents\WindowsPowerShell\Modules
  2. Download WhisperPS.zip from releases, extract ZIP into that folder
  3. Launch PowerShell. Run following commands to transcribe all mp3 and m4a into text files without timestamps:
Import-Module WhisperPS -DisableNameChecking
$Model = Import-WhisperModel D:\Data\Whisper\ggml-medium.bin
cd <directory where you have input files>
dir * -include *.mp3, *.m4a | Transcribe-File $Model | foreach { $_ | Export-Text ($_.SourceName + ".txt") }

Replace D:\Data\Whisper\ggml-medium.bin with the path to the model file, and <directory where you have input files> with the input directory.

I think scripting is generally better way to accomplish such tasks, because allows to do more than just transcribing folders. For instance, can transcribe same files with multiple models and compare, export multiple formats for each source file, generate output file names with custom algorithms…

More information about the PowerShell module.

Const-me avatar Mar 18 '23 20:03 Const-me

i just wrote something for that, https://github.com/tigros/Whisperer hopefully you have NVidia card though. ran it overnight seems ok.

tigros avatar Mar 19 '23 01:03 tigros

Thanks for those commented above!!!

I just have a AMD gpu, but I will try out each method mentioned above.

Hope I could work it out.

kiron111 avatar Mar 20 '23 08:03 kiron111

I am using Win11, the Powershell module directory seems not same as you wrote. Even I googled to find other alternative directory, it failed.

Then I install the Powershell 7 from Microsoft, I can easily located the module directory (i.e "C:\Program Files\PowerShell\7\Modules")

Finally it works, thanks!!!!! 👍 💯

kiron111 avatar Mar 20 '23 09:03 kiron111

i just wrote something for that, https://github.com/tigros/Whisperer hopefully you have NVidia card though. ran it overnight seems ok.

Errors occur for AMD gpu :-(

kiron111 avatar Mar 20 '23 09:03 kiron111

@kiron111 I think you can do that with the new PowerShell wrapper. Here’s how.

  1. Create the folder %USERPROFILE%\Documents\WindowsPowerShell\Modules
  2. Download WhisperPS.zip from releases, extract ZIP into that folder
  3. Launch PowerShell. Run following commands to transcribe all mp3 and m4a into text files without timestamps:
Import-Module WhisperPS -DisableNameChecking
$Model = Import-WhisperModel D:\Data\Whisper\ggml-medium.bin
cd <directory where you have input files>
dir * -include *.mp3, *.m4a | Transcribe-File $Model | foreach { $_ | Export-Text ($_.SourceName + ".txt") }

Replace D:\Data\Whisper\ggml-medium.bin with the path to the model file, and <directory where you have input files> with the input directory.

I think scripting is generally better way to accomplish such tasks, because allows to do more than just transcribing folders. For instance, can transcribe same files with multiple models and compare, export multiple formats for each source file, generate output file names with custom algorithms…

More information about the PowerShell module.

One problems remains, all my output files are translated to English.

Any command to stop the auto translation to English? coz I mainly transcribe Chinese audio.

Thanks!!!

kiron111 avatar Mar 20 '23 09:03 kiron111

@kiron111 You’re getting English because it’s the default. For Chinese, you need another parameter to the Transcribe-File cmdlet: -language zh So the complete line becomes that:

dir * -include *.mp3, *.m4a | Transcribe-File $Model -language zh | foreach { $_ | Export-Text ($_.SourceName + ".txt") }

Const-me avatar Mar 20 '23 10:03 Const-me

@kiron111 The PowerShell module version 1.10.1 supports human-readable names for the languages, such as Chinese, in addition to the codes defined by 1988 version of ISO 639-1 which are used by OpenAI.

Const-me avatar Mar 20 '23 12:03 Const-me

@kiron111 You’re getting English because it’s the default. For Chinese, you need another parameter to the Transcribe-File cmdlet: -language zh So the complete line becomes that:

dir * -include *.mp3, *.m4a | Transcribe-File $Model -language zh | foreach { $_ | Export-Text ($_.SourceName + ".txt") }

ok thanks let me try it

kiron111 avatar Mar 20 '23 13:03 kiron111

Errors occur for AMD gpu :-(

Hi, I made a fix for AMD, please let me know if it works.

Thanks.

tigros avatar Mar 20 '23 23:03 tigros

it works for AMD!!!

Thanks for your hard work on the software

kiron111 avatar Mar 21 '23 12:03 kiron111

Thank you . But I am not sure how to use "Export-SubRip". I change this statement: dir * -include *.mp3, *.m4a | Transcribe-File $Model | foreach { $_ | Export-Text ($.SourceName + ".txt") into : dir * -include *.mp3, *.m4a | Transcribe-File $Model | foreach { $ | Export-SubRip ($_.SourceName + ".srt")

Yes, I got the srt file, but no timeline.... What should I do?

Chess888 avatar Mar 22 '23 11:03 Chess888

I wrote a python script to do so, using the cli release. You might want to change the comment string in the script to suit your speaking language:

from datetime import datetime
from pathlib import Path
import rich
from subprocess import PIPE, Popen
from typing import List
import typer
from enum import Enum

app = typer.Typer(add_completion=False,  # 在帮助中不添加关于自动填充的选项
                  rich_markup_mode='rich')  # 帮助文档的标记语言设为 rich


class Model(str, Enum):
    tiny = 'tiny'
    base = 'base'
    small = 'small'
    medium = 'medium'
    large = 'large'


def get_task(path: List[Path]) -> List[Path]:
    """得到任务列表,其中每个任务是一个音视频文件的路径"""
    task_list = []
    for p in path:
        if p.is_file() and p.suffix.lower() in ['.mp3', '.mp4', '.wav', '.avi', '.mkv', '.mov']:
            task_list.append(p)
        elif p.is_dir():
            for f in p.rglob('*.*'):
                if f.suffix.lower() in ['.mp3', '.mp4', '.wav', '.avi', '.mkv', '.mov']:
                    task_list.append(f)
    return task_list


def get_srt(path: Path, model: Path, language: str) -> float:
    """调用 whisper-main.exe 生成 srt 字幕文件"""
    t1 = datetime.now()
    args = ['whisper-main.exe', '-m',
            str(model), '-l', language, '-osrt', str(path)]
    proc = Popen(args, stdout=PIPE, stderr=PIPE)
    stdout, stderr = proc.communicate()
    t2 = datetime.now()
    elapsed = (t2 - t1).total_seconds()
    ram_status = stderr.decode('utf-8').splitlines()[-1]
    rich.print(f'[grey50]{ram_status}[/grey50]')  # 汇报内存、显存使用情况
    return elapsed


@app.command(epilog='Made with [red]:heart:[/red]  by Haujet')
def main(path: List[Path] = typer.Argument(None, help="要处理的文件或文件夹"),
         model: Model = typer.Option(Model.large,
                                     '-m', '--model',
                                     help='识别所用的模型',
                                     show_default=False),
         language: str = typer.Option('zh',
                                      '-l', '--language',
                                      help='所要识别的语言,如:zh, en, fr, ru'),
         ):
    """
    批量转换音视频文件为字幕文件
    """

    if not path:
        path.append(Path(input('请输入一个要处理的文件:').strip('"')))

    # 现在 path 是一个列表,要依次对其处理,得到要处理的所有音视频文件的 Path,
    # 放入一个新的任务列表
    task_list = get_task(path)

    # 输出所用的语言和模型,以及处理的文件列表
    typer.echo(f"\n将使用 {model.value} 模型进行语音识别,识别语言为 {language}")
    model = Path('./model/' + model.value + '.bin')

    if not model.is_file():
        raise Exception('模型文件不存在')

    typer.echo(f"共有 {len(task_list)} 个文件需要处理:")
    for i, task in enumerate(task_list):
        typer.echo(f"    {i+1}. {task}")

    # 逐个处理任务
    count = len(task_list)
    for i, task in enumerate(task_list, start=1):
        typer.secho(
            f"\n正在处理第 {i}/{len(task_list)} 个文件:{task}", fg='bright_cyan')
        elapsed = get_srt(task, model, language)
        rich.print(f"[grey50]用时 {elapsed:.2f} 秒\n[grey50]")

    rich.print(f'[green]全部任务完成![/green]')


if __name__ == '__main__':
    app()

HaujetZhao avatar Mar 23 '23 03:03 HaujetZhao

yeah, I encountered the same problems. Before the fix come, try another whisper github project, it works: https://github.com/tigros/Whisperer

kiron111 avatar Mar 23 '23 13:03 kiron111

@kiron111 I think you can do that with the new PowerShell wrapper. Here’s how.

  1. Create the folder %USERPROFILE%\Documents\WindowsPowerShell\Modules
  2. Download WhisperPS.zip from releases, extract ZIP into that folder
  3. Launch PowerShell. Run following commands to transcribe all mp3 and m4a into text files without timestamps:
Import-Module WhisperPS -DisableNameChecking
$Model = Import-WhisperModel D:\Data\Whisper\ggml-medium.bin
cd <directory where you have input files>
dir * -include *.mp3, *.m4a | Transcribe-File $Model | foreach { $_ | Export-Text ($_.SourceName + ".txt") }

Replace D:\Data\Whisper\ggml-medium.bin with the path to the model file, and <directory where you have input files> with the input directory.

I think scripting is generally better way to accomplish such tasks, because allows to do more than just transcribing folders. For instance, can transcribe same files with multiple models and compare, export multiple formats for each source file, generate output file names with custom algorithms…

More information about the PowerShell module.

How should I modify the parameters if I want to output results with a timestamp?

wolfydw avatar Apr 07 '23 07:04 wolfydw

I wrote a python script to do so, using the cli release. You might want to change the comment string in the script to suit your speaking language:

from datetime import datetime
from pathlib import Path
import rich
from subprocess import PIPE, Popen
from typing import List
import typer
from enum import Enum

app = typer.Typer(add_completion=False,  # 在帮助中不添加关于自动填充的选项
                  rich_markup_mode='rich')  # 帮助文档的标记语言设为 rich


class Model(str, Enum):
    tiny = 'tiny'
    base = 'base'
    small = 'small'
    medium = 'medium'
    large = 'large'


def get_task(path: List[Path]) -> List[Path]:
    """得到任务列表,其中每个任务是一个音视频文件的路径"""
    task_list = []
    for p in path:
        if p.is_file() and p.suffix.lower() in ['.mp3', '.mp4', '.wav', '.avi', '.mkv', '.mov']:
            task_list.append(p)
        elif p.is_dir():
            for f in p.rglob('*.*'):
                if f.suffix.lower() in ['.mp3', '.mp4', '.wav', '.avi', '.mkv', '.mov']:
                    task_list.append(f)
    return task_list


def get_srt(path: Path, model: Path, language: str) -> float:
    """调用 whisper-main.exe 生成 srt 字幕文件"""
    t1 = datetime.now()
    args = ['whisper-main.exe', '-m',
            str(model), '-l', language, '-osrt', str(path)]
    proc = Popen(args, stdout=PIPE, stderr=PIPE)
    stdout, stderr = proc.communicate()
    t2 = datetime.now()
    elapsed = (t2 - t1).total_seconds()
    ram_status = stderr.decode('utf-8').splitlines()[-1]
    rich.print(f'[grey50]{ram_status}[/grey50]')  # 汇报内存、显存使用情况
    return elapsed


@app.command(epilog='Made with [red]:heart:[/red]  by Haujet')
def main(path: List[Path] = typer.Argument(None, help="要处理的文件或文件夹"),
         model: Model = typer.Option(Model.large,
                                     '-m', '--model',
                                     help='识别所用的模型',
                                     show_default=False),
         language: str = typer.Option('zh',
                                      '-l', '--language',
                                      help='所要识别的语言,如:zh, en, fr, ru'),
         ):
    """
    批量转换音视频文件为字幕文件
    """

    if not path:
        path.append(Path(input('请输入一个要处理的文件:').strip('"')))

    # 现在 path 是一个列表,要依次对其处理,得到要处理的所有音视频文件的 Path,
    # 放入一个新的任务列表
    task_list = get_task(path)

    # 输出所用的语言和模型,以及处理的文件列表
    typer.echo(f"\n将使用 {model.value} 模型进行语音识别,识别语言为 {language}")
    model = Path('./model/' + model.value + '.bin')

    if not model.is_file():
        raise Exception('模型文件不存在')

    typer.echo(f"共有 {len(task_list)} 个文件需要处理:")
    for i, task in enumerate(task_list):
        typer.echo(f"    {i+1}. {task}")

    # 逐个处理任务
    count = len(task_list)
    for i, task in enumerate(task_list, start=1):
        typer.secho(
            f"\n正在处理第 {i}/{len(task_list)} 个文件:{task}", fg='bright_cyan')
        elapsed = get_srt(task, model, language)
        rich.print(f"[grey50]用时 {elapsed:.2f} 秒\n[grey50]")

    rich.print(f'[green]全部任务完成![/green]')


if __name__ == '__main__':
    app()

cli runs too slowly run compared to the desktop version.

actforjason avatar May 07 '23 12:05 actforjason