eesen-transcriber icon indicating copy to clipboard operation
eesen-transcriber copied to clipboard

Alignment "doubling"

Open aolney opened this issue 6 years ago • 3 comments

align.sh is doubling output, and the times are way off. Here is the STM, which was generated from the SRT subtitles (CC) from FFMPEG:

11	A	FakeSpeaker	3.103	5.606	and now a fireside chat
11	A	FakeSpeaker	5.606	6.607	with the creators of comedy central's south park,
11	A	FakeSpeaker	6.607	8.609	matt stone and trey parker.
11	A	FakeSpeaker	13.614	15.116	hi. i'm trey parker.
11	A	FakeSpeaker	15.115	16.617	and i'm matt stone.

Here is the ali file

1-1-S0---0025.380-0032.830 1 25.38 0.06 now
1-1-S0---0025.380-0032.830 1 25.44 0.00 a
1-1-S0---0025.380-0032.830 1 25.44 0.00 fireside
1-1-S0---0025.380-0032.830 1 25.44 1.77 chat
1-1-S0---0025.380-0032.830 1 27.21 0.09 now
1-1-S0---0025.380-0032.830 1 27.30 0.36 a
1-1-S0---0025.380-0032.830 1 27.66 0.87 fireside
1-1-S0---0025.380-0032.830 1 28.53 0.24 chat
1-1-S0---0025.380-0032.830 1 28.77 0.06 now
1-1-S0---0025.380-0032.830 1 28.83 0.12 a
1-1-S0---0025.380-0032.830 1 28.95 1.08 fireside
1-1-S0---0025.380-0032.830 1 30.03 0.63 chat
1-1-S0---0025.380-0032.830 1 30.66 0.12 now
1-1-S0---0025.380-0032.830 1 30.78 0.45 a
1-1-S0---0025.380-0032.830 1 31.23 0.39 fireside
1-1-S0---0025.380-0032.830 1 31.62 1.20 chat
1-1-S0---0032.830-0051.960 1 32.83 0.03 the
1-1-S0---0032.830-0051.960 1 32.86 0.99 creators
1-1-S0---0032.830-0051.960 1 33.85 0.12 of
1-1-S0---0032.830-0051.960 1 33.97 1.29 comedy
1-1-S0---0032.830-0051.960 1 35.26 4.26 central's
1-1-S0---0032.830-0051.960 1 39.52 0.00 south
1-1-S0---0032.830-0051.960 1 39.52 3.51 <unk>
1-1-S0---0032.830-0051.960 1 43.03 0.24 the
1-1-S0---0032.830-0051.960 1 43.27 0.87 creators
1-1-S0---0032.830-0051.960 1 44.14 0.06 of
1-1-S0---0032.830-0051.960 1 44.20 1.08 comedy
1-1-S0---0032.830-0051.960 1 45.28 1.47 central's
1-1-S0---0032.830-0051.960 1 46.75 0.00 south
1-1-S0---0032.830-0051.960 1 46.75 3.09 <unk>
1-1-S0---0032.830-0051.960 1 49.84 0.18 the
1-1-S0---0032.830-0051.960 1 50.02 0.54 creators
1-1-S0---0032.830-0051.960 1 50.56 0.03 of
1-1-S0---0032.830-0051.960 1 50.59 0.00 comedy
1-1-S0---0032.830-0051.960 1 50.59 1.11 central's
1-1-S0---0051.960-0064.490 1 51.96 0.42 stone
1-1-S0---0051.960-0064.490 1 52.38 0.09 and
1-1-S0---0051.960-0064.490 1 52.47 0.33 trey
1-1-S0---0051.960-0064.490 1 52.80 0.00 <unk>
1-1-S0---0051.960-0064.490 1 52.80 0.00 stone
1-1-S0---0051.960-0064.490 1 52.80 0.27 and
1-1-S0---0051.960-0064.490 1 53.07 0.00 trey
1-1-S0---0051.960-0064.490 1 53.07 0.00 <unk>
1-1-S0---0051.960-0064.490 1 53.07 0.00 stone
1-1-S0---0051.960-0064.490 1 53.07 0.00 and
1-1-S0---0051.960-0064.490 1 53.07 0.00 trey
1-1-S0---0051.960-0064.490 1 53.07 0.00 <unk>
1-1-S0---0051.960-0064.490 1 53.07 0.00 stone

Any suggestions would be appreciated. Regular ASR functionality (with kaldi) is working fine. FWIW my steps and utils are linked to kaldi and not to eesen.

aolney avatar May 30 '19 01:05 aolney

The problem seemed to be in the Makefile. Instead of using the STM, it was running LIUM. Below is my align.sh that seems to have fixed this problem:

#!/bin/bash

# Copyright 2016  er1k
# Apache 2.0

# Prepare data for, and run align_ctc_utts.sh script that generates word-level alignments
# in an "Eesen Transccriber-centric" way  output is found in build/output/<basename>.ali

# Required inputs:
#
# * a 'hypothesis' text file for which to compute alignments, extension .txt
#   one utterance per line. If no hypothesis text is found, text
#   is obtained from the STM file below
# * an STM file with utterance/segment timings - 'perfect' transcription
# * an audio file, extension can vary (.mp3, .wav, .mp4 etc)

BASEDIR=$(dirname $0)
EESEN_ROOT=~/eesen

# Change these if you're using different models 
#GRAPH_DIR=$EESEN_ROOT/asr_egs/tedlium/v2-30ms/data/lang_phn_test_test_newlm
GRAPH_DIR=$EESEN_ROOT/asr_egs/tedlium/v2-30ms/data/lang_phn_test
MODEL_DIR=$EESEN_ROOT/asr_egs/tedlium/v2-30ms/exp/train_phn_l5_c320_v1s

# Defaults
frame_shift=0.03  # 30 ms frames
lm_weight=0.8     # same as best setting for 30ms eesen tedlium transcriber

. path.sh
. $BASEDIR/utils/parse_options.sh

filename=$(basename "$1")
basename="${filename%.*}"
dirname=$(dirname "$1")
extension="${filename##*.}"

cd $BASEDIR
echo "In $BASEDIR"

if [ $# -ne 1 ]; then
  echo "Usage: align.sh <basename>.{wav,mp3,mp4,sph}"
  echo " in same folder is test text named <basename>.txt"
  echo " and STM file named <basename>.stm (for segments)"
  echo " ./align.sh /vagrant/GaryFlake_2010.wav"
  echo " output is build/output/<basename>.ali"
  exit 1;
fi

mkdir -p $BASEDIR/build/audio/base $BASEDIR/build/output

# un-shorten-ify SPH files
#if [ $extension == "sph" ]; then
#    sph2pipe $1 > build/audio/base/$basename.unshorten
#    sox build/audio/base/$basename.unshorten -c 1 build/audio/base/$basename.wav rate -v 16k
#fi

mkdir -p $BASEDIR/src-audio
cp $1 $BASEDIR/src-audio
#prefixing with BASEDIR throws off make rule?
#make $BASEDIR/build/audio/base/$basename.wav
make build/audio/base/$basename.wav

# 8k
# sox $1 -c 1 -e signed-integer build/audio/base/$basename.wav rate -v 8k

mkdir -p $BASEDIR/build/diarization/$basename
# make STM from cha
if [ -f $dirname/$basename.cha -a ! -f $dirname/$basename.stm ]; then
  local/cha2stm.sh $dirname/$basename.cha | sed 's/xxx/\<unk\>/g' > build/output/$basename.stm
elif [ -f $dirname/$basename.stm ]; then
  cp $dirname/$basename.stm build/output/
elif [ ! -f $dirname/$basename.stm ]; then
  echo "Needs either a .cha or .stm file to get utterances"
  exit 1
fi

#if [ ! -f $dirname/$basename.txt ]; then
#  echo "Needs .txt file with utterance per line as reference text to align"
#  exit 1
#fi

# make segments from $1.stm
cat build/output/$basename.stm | grep -v ';;' | grep -v "inter_segment_gap" | grep -v "ignore_time_segment_in_scoring" | awk '{OFMT = "%.0f"; print $1,$2,$4*100,($5-$4)*100,"M S U",$2}' > build/diarization/$basename/show.seg


# Generate features
cd $BASEDIR
rm -rf build/trans/$basename

make SEGMENTS=show.seg build/trans/$basename/fbank

# Expect test text in format with utterance IDs per line
uttdata=build/trans/$basename
#if [ -f $dirname/$basename.txt ];
#  then
#    echo "Aligning text found at $dirname/$basename.txt"
#    cat $dirname/$basename.txt | awk '{print NR" "$0}' > $uttdata/text
#  else
    echo "Aligning text found in build/output/$basename.stm"
    cat build/output/$basename.stm | awk '{$1="";$2="";$3="";$4="";$5="";$6=""; print NR$0}' \
	| sed 's/ \+/ /' > $uttdata/text
#fi
cp build/diarization/$basename/show.seg $uttdata

#local/align_ctc_multi_utts.sh --acoustic_scale 0.8 $GRAPH_DIR $GRAPH_DIR $uttdata  $MODEL_DIR $uttdata/align
#                                                   <langdir>  <data>     <uttdata> <mdldir>   <dir>
local/align_ctc_multi_utts.sh --acoustic_scale $lm_weight $GRAPH_DIR $GRAPH_DIR $uttdata  $MODEL_DIR $uttdata/align

# Copy results to someplace useful
cp $uttdata/align/ali build/output/$basename.ali

aolney avatar May 31 '19 17:05 aolney

will need to look into this some other time, please let me know if you have other information or updates

fmetze avatar Jun 04 '19 18:06 fmetze

Only that once the STM was properly used, the doubling issue went away. However, the alignments still seemed off.

aolney avatar Jun 04 '19 20:06 aolney