FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

Timestamps of words seems not correct in file transcription service

Open electroniccc opened this issue 1 year ago • 1 comments

When I try to convert some audio files, I notice that the timestamps in the returned result don't look correct. For example, the total duration of the audio file is about 6 minutes, but the timestamp of the last word is about 600s. The file info: image The result: image

My enviroment: OS: Ubuntu22.04(WSL) Docker image:

a186f040b0a1   registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.3.0   "/bin/bash"   3 weeks ago   Up 30 seconds             0.0.0.0:10095->10095/tcp   funasr

Start command:

nohup bash run_server.sh \
  --certfile "" \
  --model_dir damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-onnx \
  > log.out 2>&1 &

The sample file: https://drive.google.com/file/d/1YS3gWovNJIPDjN9gs-vBxoNx7dUvvO2l/view?usp=drive_link

electroniccc avatar Dec 11 '23 12:12 electroniccc

Currently, after passing through the ITN, timestamp misalignment occurs. The issue has been fixed and will be released in the next version.

lyblsgo avatar Dec 12 '23 03:12 lyblsgo

try funasr:funasr-runtime-sdk-cpu-0.4.0

lyblsgo avatar Jan 03 '24 08:01 lyblsgo

Tried funasr-runtime-sdk-cpu-0.4.2, the issue still exists.

electroniccc avatar Jan 28 '24 10:01 electroniccc

If the issue persists, please provide detailed steps to reproduce, as well as server and client logs.

lyblsgo avatar Feb 05 '24 07:02 lyblsgo

Currently, after passing through the ITN, timestamp misalignment occurs. The issue has been fixed and will be released in the next version.

The issue mentioned by the poster doesn't happen when doing ASR via python inference though:

model = AutoModel(model="iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
                  model_revision="v2.0.4",
                  )
res = model.generate(input=wavf)

But I did find some misalignment cases where some sentences within a long audio mis-align by around 0.5s by using the above python code. Is it possible that the ITN issue you mentioned is responsible for this?

wincing2 avatar Feb 14 '24 13:02 wincing2