lingvo
lingvo copied to clipboard
About preprocessing data in ASR
Hi, I'm trying to download preprocess the dataset for the ASR task with the files: lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh and lingvo/tasks/asr/tools/librispeech.04.parameterize_devtest.sh. I want to put them in a volume folder, so that I don't lose data when the container stops.
Here's what I did:
Setting up environment
$ LINGVO_DIR="/tmp/lingvo” $ rm -rf "$LINGVO_DIR" $ git clone https://github.com/tensorflow/lingvo.git "$LINGVO_DIR" $ cd "$LINGVO_DIR"
Create docker image
$ docker build --tag tensorflow:lingvo - < docker/dev.dockerfile --no-cache
Create docker volume for the dataset $ docker volume create librispeech
Run container with the empty volume for the dataset and the volume with the repository folder $ docker run -it -v librispeech:/tmp/librispeech -v ${LINGVO_DIR}:/tmp/lingvo --name lingvo tensorflow:lingvo bash
this pass, so I'm good $ bazel test -c opt //lingvo/core/ops:beam_search_step_op_test
Download dataset
$ lingvo/tasks/asr/tools/librispeech.01.download_train.sh $ lingvo/tasks/asr/tools/librispeech.02.download_devtest.sh
Parametrize dataset
$ bazel build -c opt //lingvo/tools:create_asr_features
Here's when I get the error:
$ lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh
error:
I0819 21:27:02.294093 140003015599872 create_asr_features.py:84] First pass: loading text files... Traceback (most recent call last): File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module> tf.app.run(main) File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 204, in main _DumpTranscripts() File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 117, in _DumpTranscripts trans = _ReadTranscriptions() File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 108, in _ReadTranscriptions uttid, txt = l.strip('\n').split(' ', 1) TypeError: a bytes-like object is required, not 'str'
Can someone tell me why? Thanks
Are you running on py3 by any chance?
On Mon, Aug 19, 2019 at 2:41 PM alessiaatunimi [email protected] wrote:
Hi, I'm trying to download preprocess the dataset for the ASR task with the files: lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh http://librispeech.03.parameterize_train.sh and lingvo/tasks/asr/tools/librispeech.04.parameterize_devtest.sh http://librispeech.04.parameterize_devtest.sh. I want to put them in a volume folder, so that I don't lose data when the container stops.
Here's what I did:
Setting up environment
$ LINGVO_DIR="/tmp/lingvo” $ rm -rf "$LINGVO_DIR" $ git clone https://github.com/tensorflow/lingvo.git "$LINGVO_DIR" $ cd "$LINGVO_DIR"
Create docker image
$ docker build --tag tensorflow:lingvo - < docker/dev.dockerfile --no-cache
Create docker volume for the dataset $ docker volume create librispeech
Run container with the empty volume for the dataset and the volume with the repository folder $ docker run -it -v librispeech:/tmp/librispeech -v ${LINGVO_DIR}:/tmp/lingvo --name lingvo tensorflow:lingvo bash
this pass, so I'm good $ bazel test -c opt //lingvo/core/ops:beam_search_step_op_test
Download dataset
$ lingvo/tasks/asr/tools/librispeech.01.download_train.sh $ lingvo/tasks/asr/tools/librispeech.02.download_devtest.sh
Parametrize dataset
$ bazel build -c opt //lingvo/tools:create_asr_features
Here's when I get the error:
$ lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh
error:
I0819 21:27:02.294093 140003015599872 create_asr_features.py:84] First pass: loading text files... Traceback (most recent call last): File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/main/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/main/lingvo/tools/create_asr_features.py", line 213, in
tf.app.run(main) File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/main/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/main/lingvo/tools/create_asr_features.py", line 204, in main _DumpTranscripts() File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/main/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/main/lingvo/tools/create_asr_features.py", line 117, in _DumpTranscripts trans = _ReadTranscriptions() File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/main/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/main/lingvo/tools/create_asr_features.py", line 108, in _ReadTranscriptions uttid, txt = l.strip('\n').split(' ', 1) TypeError: a bytes-like object is required, not 'str' Can someone tell me why? Thanks
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/lingvo/issues/146?email_source=notifications&email_token=AE75E3M7ILRTVNFUZE7BWUDQFMHO5A5CNFSM4ING34SKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HGDADYA, or mute the thread https://github.com/notifications/unsubscribe-auth/AE75E3I3IJKZSAICK4ESOO3QFMHO5ANCNFSM4ING34SA .
If you mean the python version I'm using Python 2.7.12. It's not the first time that I run this file and it always worked...
Nothing springs to mind. Can you print type(l)
?
I insert print before line 108 of lingvo/tools/create_asr_features.py in this way:
for l in f.readlines():
print('here is the type of l:')
print(type(l))
uttid, txt = l.strip('\n').split(' ', 1)
and here's the output
here is the type of l:
<class 'bytes'>
after that, the same error as before (I don't know if you meant this with
Can you print type(l)?
Let me know if you need some other infos)
Also, doing git clone https://github.com/tensorflow/lingvo.git "$LINGVO_DIR"
I get those errors along the way:
ERROR: proto-google-cloud-datastore-v1 0.90.4 has requirement oauth2client<4.0dev,>=2.0.0, but you'll have oauth2client 4.1.3 which is incompatible. ERROR: apache-beam 2.14.0 has requirement httplib2<=0.12.0,>=0.8, but you'll have httplib2 0.13.1 which is incompatible. ERROR: apache-beam 2.14.0 has requirement oauth2client<4,>=2.0.1, but you'll have oauth2client 4.1.3 which is incompatible. ERROR: apache-beam 2.14.0 has requirement typing<3.7.0,>=3.6.0; python_version < "3.5.0", but you'll have typing 3.7.4 which is incompatible. ERROR: googledatastore 7.0.2 has requirement httplib2<=0.12.0,>=0.9.1, but you'll have httplib2 0.13.1 which is incompatible. ERROR: googledatastore 7.0.2 has requirement oauth2client<4.0.0,>=2.0.1, but you'll have oauth2client 4.1.3 which is incompatible. ERROR: apache-beam 2.14.0 has requirement httplib2<=0.12.0,>=0.8, but you'll have httplib2 0.13.1 which is incompatible. ERROR: apache-beam 2.14.0 has requirement oauth2client<4,>=2.0.1, but you'll have oauth2client 4.1.3 which is incompatible.
but then the output is
Successfully built 69e33c7cb4ab Successfully tagged tensorflow:lingvo
So I thought it was not a problem
The only thing that I changed from the other time that this file run correctly is that I made /tmp/librispeech a volume... could it be a problem?
I don't really understand why readlines
would return bytes instead of a
string. Just do str(l).strip(..).split(..)
instead. You can send a pull
request.
On Tue, Aug 20, 2019 at 8:22 AM alessiaatunimi [email protected] wrote:
The only thing that I changed from the other time that this file run correctly is that I made /tmp/librispeech a volume... could it be a problem?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/lingvo/issues/146?email_source=notifications&email_token=AE75E3MIALFXALDAVDDHL4LQFQD2FA5CNFSM4ING34SKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4WVLFI#issuecomment-523064725, or mute the thread https://github.com/notifications/unsubscribe-auth/AE75E3J7QKSCHSES45HULL3QFQD2FANCNFSM4ING34SA .
I changed the line:
uttid, txt = l.strip('\n').split(' ', 1)
into
uttid, txt = str(l).strip('\n').split(' ', 1)
In this way the First pass ("=== First pass, collecting transcripts: ${subset}") went well for every subset. Then, for each subset, the second pass( "=== Second pass, parameterization: ${subset}") ends in this way:
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
tf.app.run(main)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
_CreateAsrFeatures()
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
assert uttid in trans, uttid
AssertionError: 374-180298-0038
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
tf.app.run(main)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
_CreateAsrFeatures()
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
assert uttid in trans, uttid
AssertionError: 374-180298-0060
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
tf.app.run(main)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
_CreateAsrFeatures()
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
assert uttid in trans, uttid
AssertionError: 374-180298-0006
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
tf.app.run(main)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
_CreateAsrFeatures()
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
assert uttid in trans, uttid
AssertionError: 374-180298-0021
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
tf.app.run(main)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
_CreateAsrFeatures()
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
assert uttid in trans, uttid
AssertionError: 374-180298-0028
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
tf.app.run(main)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
_CreateAsrFeatures()
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
assert uttid in trans, uttid
AssertionError: 374-180298-0022
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
tf.app.run(main)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
_CreateAsrFeatures()
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
assert uttid in trans, uttid
AssertionError: 374-180298-0044
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
tf.app.run(main)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
_CreateAsrFeatures()
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
assert uttid in trans, uttid
AssertionError: 374-180298-0036
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
tf.app.run(main)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
_CreateAsrFeatures()
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
assert uttid in trans, uttid
AssertionError: 374-180298-0045
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
tf.app.run(main)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
_CreateAsrFeatures()
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
assert uttid in trans, uttid
AssertionError: 374-180298-0051
+ touch FAILED
+ touch FAILED
+ touch FAILED
+ touch FAILED
+ touch FAILED
+ touch FAILED
+ touch FAILED
+ touch FAILED
+ touch FAILED
+ touch FAILED
SOLUTION
Finally understand from where did that problem came from. Seeing that I changed the line:
uttid, txt = l.strip('\n').split(' ', 1)
into
uttid, txt = str(l).strip('\n').split(' ', 1)
uttid was stringified, and so the line
trans[uttid] = text
created a dictionary like:
"b'6877-77361-0040": "IT ENTERED INTO MY THOUGHTS THAT I MIGHT END THE MATTER NOW AND LET THESE OTHERS GO TO WADE OUT INTO THE SEA INTO THIS WARM LAPPING THAT MINGLED THE NATURES OF WATER AND LIGHT TO STAND THERE BREAST HIGH\\n'"
and so trans[uttid] didn't work. I solved changing a little bit the _ReadTranscription() function:
def _ReadTranscriptions():
"""Read all transcription files from the tarball.
Returns:
A map of utterance id to upper case transcription.
"""
tar = tarfile.open(FLAGS.input_tarball, mode='r:gz')
n = 0
tf.logging.info('First pass: loading text files...')
trans = {}
for tarinfo in tar:
if not tarinfo.isreg():
continue
n += 1
if 0 == n % 10000:
tf.logging.info('Scanned %d entries...', n)
if not tarinfo.name.endswith('.trans.txt'):
continue
key = tarinfo.name.strip('.trans.txt')
f = tar.extractfile(tarinfo)
u = 0
#beginning of part changed
for l in f.readlines():
uttid, txt = str(l).strip('\n').split(' ', 1)
#add if-else condition because some uttid was in the form "b'_numbers_" and others 'b"numbers' and so one split wasn't enough
if len(uttid.split("b'"))>1:
trans[uttid.split("b'")[1]] = txt
else:
trans[uttid.split('b"')[1]] = txt
#end of part changed
u += 1
tf.logging.info('[%s] = %d utterances', key, u)
f.close()
return trans
In this way seems to work
@alessiaatunimi I faced the same issue. It is fixed by using the solution you suggested. Thanks