FunASR
FunASR copied to clipboard
Timestamps of words seems not correct in file transcription service
When I try to convert some audio files, I notice that the timestamps in the returned result don't look correct. For example, the total duration of the audio file is about 6 minutes, but the timestamp of the last word is about 600s.
The file info:
The result:
My enviroment: OS: Ubuntu22.04(WSL) Docker image:
a186f040b0a1 registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.3.0 "/bin/bash" 3 weeks ago Up 30 seconds 0.0.0.0:10095->10095/tcp funasr
Start command:
nohup bash run_server.sh \
--certfile "" \
--model_dir damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-onnx \
> log.out 2>&1 &
The sample file: https://drive.google.com/file/d/1YS3gWovNJIPDjN9gs-vBxoNx7dUvvO2l/view?usp=drive_link
Currently, after passing through the ITN, timestamp misalignment occurs. The issue has been fixed and will be released in the next version.
try funasr:funasr-runtime-sdk-cpu-0.4.0
Tried funasr-runtime-sdk-cpu-0.4.2, the issue still exists.
If the issue persists, please provide detailed steps to reproduce, as well as server and client logs.
Currently, after passing through the ITN, timestamp misalignment occurs. The issue has been fixed and will be released in the next version.
The issue mentioned by the poster doesn't happen when doing ASR via python inference though:
model = AutoModel(model="iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
model_revision="v2.0.4",
)
res = model.generate(input=wavf)
But I did find some misalignment cases where some sentences within a long audio mis-align by around 0.5s by using the above python code. Is it possible that the ITN issue you mentioned is responsible for this?