lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

About preprocessing data in ASR

Open alessiaatunimi opened this issue 5 years ago • 9 comments

Hi, I'm trying to download preprocess the dataset for the ASR task with the files: lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh and lingvo/tasks/asr/tools/librispeech.04.parameterize_devtest.sh. I want to put them in a volume folder, so that I don't lose data when the container stops.

Here's what I did:

Setting up environment

$ LINGVO_DIR="/tmp/lingvo” $ rm -rf "$LINGVO_DIR" $ git clone https://github.com/tensorflow/lingvo.git "$LINGVO_DIR" $ cd "$LINGVO_DIR"

Create docker image

$ docker build --tag tensorflow:lingvo - < docker/dev.dockerfile --no-cache

Create docker volume for the dataset $ docker volume create librispeech

Run container with the empty volume for the dataset and the volume with the repository folder $ docker run -it -v librispeech:/tmp/librispeech -v ${LINGVO_DIR}:/tmp/lingvo --name lingvo tensorflow:lingvo bash

this pass, so I'm good $ bazel test -c opt //lingvo/core/ops:beam_search_step_op_test

Download dataset

$ lingvo/tasks/asr/tools/librispeech.01.download_train.sh $ lingvo/tasks/asr/tools/librispeech.02.download_devtest.sh

Parametrize dataset

$ bazel build -c opt //lingvo/tools:create_asr_features

Here's when I get the error:

$ lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh

error:

I0819 21:27:02.294093 140003015599872 create_asr_features.py:84] First pass: loading text files... Traceback (most recent call last): File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module> tf.app.run(main) File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 204, in main _DumpTranscripts() File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 117, in _DumpTranscripts trans = _ReadTranscriptions() File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 108, in _ReadTranscriptions uttid, txt = l.strip('\n').split(' ', 1) TypeError: a bytes-like object is required, not 'str'

Can someone tell me why? Thanks

alessiaatunimi avatar Aug 19 '19 21:08 alessiaatunimi

Are you running on py3 by any chance?

On Mon, Aug 19, 2019 at 2:41 PM alessiaatunimi [email protected] wrote:

Hi, I'm trying to download preprocess the dataset for the ASR task with the files: lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh http://librispeech.03.parameterize_train.sh and lingvo/tasks/asr/tools/librispeech.04.parameterize_devtest.sh http://librispeech.04.parameterize_devtest.sh. I want to put them in a volume folder, so that I don't lose data when the container stops.

Here's what I did:

Setting up environment

$ LINGVO_DIR="/tmp/lingvo” $ rm -rf "$LINGVO_DIR" $ git clone https://github.com/tensorflow/lingvo.git "$LINGVO_DIR" $ cd "$LINGVO_DIR"

Create docker image

$ docker build --tag tensorflow:lingvo - < docker/dev.dockerfile --no-cache

Create docker volume for the dataset $ docker volume create librispeech

Run container with the empty volume for the dataset and the volume with the repository folder $ docker run -it -v librispeech:/tmp/librispeech -v ${LINGVO_DIR}:/tmp/lingvo --name lingvo tensorflow:lingvo bash

this pass, so I'm good $ bazel test -c opt //lingvo/core/ops:beam_search_step_op_test

Download dataset

$ lingvo/tasks/asr/tools/librispeech.01.download_train.sh $ lingvo/tasks/asr/tools/librispeech.02.download_devtest.sh

Parametrize dataset

$ bazel build -c opt //lingvo/tools:create_asr_features

Here's when I get the error:

$ lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh

error:

I0819 21:27:02.294093 140003015599872 create_asr_features.py:84] First pass: loading text files... Traceback (most recent call last): File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/main/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/main/lingvo/tools/create_asr_features.py", line 213, in tf.app.run(main) File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/main/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/main/lingvo/tools/create_asr_features.py", line 204, in main _DumpTranscripts() File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/main/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/main/lingvo/tools/create_asr_features.py", line 117, in _DumpTranscripts trans = _ReadTranscriptions() File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/main/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/main/lingvo/tools/create_asr_features.py", line 108, in _ReadTranscriptions uttid, txt = l.strip('\n').split(' ', 1) TypeError: a bytes-like object is required, not 'str'

Can someone tell me why? Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/lingvo/issues/146?email_source=notifications&email_token=AE75E3M7ILRTVNFUZE7BWUDQFMHO5A5CNFSM4ING34SKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HGDADYA, or mute the thread https://github.com/notifications/unsubscribe-auth/AE75E3I3IJKZSAICK4ESOO3QFMHO5ANCNFSM4ING34SA .

drpngx avatar Aug 19 '19 22:08 drpngx

If you mean the python version I'm using Python 2.7.12. It's not the first time that I run this file and it always worked...

alessiaatunimi avatar Aug 19 '19 23:08 alessiaatunimi

Nothing springs to mind. Can you print type(l)?

drpngx avatar Aug 20 '19 05:08 drpngx

I insert print before line 108 of lingvo/tools/create_asr_features.py in this way:

for l in f.readlines():

      print('here is the type of l:')

      print(type(l))

      uttid, txt = l.strip('\n').split(' ', 1)

and here's the output

here is the type of l:
<class 'bytes'>

after that, the same error as before (I don't know if you meant this with

Can you print type(l)?

Let me know if you need some other infos)

Also, doing git clone https://github.com/tensorflow/lingvo.git "$LINGVO_DIR" I get those errors along the way: ERROR: proto-google-cloud-datastore-v1 0.90.4 has requirement oauth2client<4.0dev,>=2.0.0, but you'll have oauth2client 4.1.3 which is incompatible. ERROR: apache-beam 2.14.0 has requirement httplib2<=0.12.0,>=0.8, but you'll have httplib2 0.13.1 which is incompatible. ERROR: apache-beam 2.14.0 has requirement oauth2client<4,>=2.0.1, but you'll have oauth2client 4.1.3 which is incompatible. ERROR: apache-beam 2.14.0 has requirement typing<3.7.0,>=3.6.0; python_version < "3.5.0", but you'll have typing 3.7.4 which is incompatible. ERROR: googledatastore 7.0.2 has requirement httplib2<=0.12.0,>=0.9.1, but you'll have httplib2 0.13.1 which is incompatible. ERROR: googledatastore 7.0.2 has requirement oauth2client<4.0.0,>=2.0.1, but you'll have oauth2client 4.1.3 which is incompatible. ERROR: apache-beam 2.14.0 has requirement httplib2<=0.12.0,>=0.8, but you'll have httplib2 0.13.1 which is incompatible. ERROR: apache-beam 2.14.0 has requirement oauth2client<4,>=2.0.1, but you'll have oauth2client 4.1.3 which is incompatible.

but then the output is

Successfully built 69e33c7cb4ab Successfully tagged tensorflow:lingvo

So I thought it was not a problem

alessiaatunimi avatar Aug 20 '19 15:08 alessiaatunimi

The only thing that I changed from the other time that this file run correctly is that I made /tmp/librispeech a volume... could it be a problem?

alessiaatunimi avatar Aug 20 '19 15:08 alessiaatunimi

I don't really understand why readlines would return bytes instead of a string. Just do str(l).strip(..).split(..) instead. You can send a pull request.

On Tue, Aug 20, 2019 at 8:22 AM alessiaatunimi [email protected] wrote:

The only thing that I changed from the other time that this file run correctly is that I made /tmp/librispeech a volume... could it be a problem?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/lingvo/issues/146?email_source=notifications&email_token=AE75E3MIALFXALDAVDDHL4LQFQD2FA5CNFSM4ING34SKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4WVLFI#issuecomment-523064725, or mute the thread https://github.com/notifications/unsubscribe-auth/AE75E3J7QKSCHSES45HULL3QFQD2FANCNFSM4ING34SA .

drpngx avatar Aug 20 '19 18:08 drpngx

I changed the line:

uttid, txt = l.strip('\n').split(' ', 1)

into

uttid, txt = str(l).strip('\n').split(' ', 1)

In this way the First pass ("=== First pass, collecting transcripts: ${subset}") went well for every subset. Then, for each subset, the second pass( "=== Second pass, parameterization: ${subset}") ends in this way:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
    _CreateAsrFeatures()
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
    assert uttid in trans, uttid
AssertionError: 374-180298-0038
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
    _CreateAsrFeatures()
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
    assert uttid in trans, uttid
AssertionError: 374-180298-0060
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
    _CreateAsrFeatures()
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
    assert uttid in trans, uttid
AssertionError: 374-180298-0006
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
    _CreateAsrFeatures()
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
    assert uttid in trans, uttid
AssertionError: 374-180298-0021
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
    _CreateAsrFeatures()
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
    assert uttid in trans, uttid
AssertionError: 374-180298-0028
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
    _CreateAsrFeatures()
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
    assert uttid in trans, uttid
AssertionError: 374-180298-0022
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
    _CreateAsrFeatures()
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
    assert uttid in trans, uttid
AssertionError: 374-180298-0044
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
    _CreateAsrFeatures()
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
    assert uttid in trans, uttid
AssertionError: 374-180298-0036
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
    _CreateAsrFeatures()
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
    assert uttid in trans, uttid
AssertionError: 374-180298-0045
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 213, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 206, in main
    _CreateAsrFeatures()
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/tools/create_asr_features.runfiles/__main__/lingvo/tools/create_asr_features.py", line 190, in _CreateAsrFeatures
    assert uttid in trans, uttid
AssertionError: 374-180298-0051
+ touch FAILED
+ touch FAILED
+ touch FAILED
+ touch FAILED
+ touch FAILED
+ touch FAILED
+ touch FAILED
+ touch FAILED
+ touch FAILED
+ touch FAILED

alessiaatunimi avatar Aug 20 '19 19:08 alessiaatunimi

SOLUTION

Finally understand from where did that problem came from. Seeing that I changed the line:

uttid, txt = l.strip('\n').split(' ', 1)

into

uttid, txt = str(l).strip('\n').split(' ', 1)

uttid was stringified, and so the line

trans[uttid] = text created a dictionary like:

"b'6877-77361-0040": "IT ENTERED INTO MY THOUGHTS THAT I MIGHT END THE MATTER NOW AND LET THESE OTHERS GO TO WADE OUT INTO THE SEA INTO THIS WARM LAPPING THAT MINGLED THE NATURES OF WATER AND LIGHT TO STAND THERE BREAST HIGH\\n'"

and so trans[uttid] didn't work. I solved changing a little bit the _ReadTranscription() function:

def _ReadTranscriptions():
  """Read all transcription files from the tarball.
  Returns:
    A map of utterance id to upper case transcription.
  """
  tar = tarfile.open(FLAGS.input_tarball, mode='r:gz')
  n = 0
  tf.logging.info('First pass: loading text files...')
  trans = {}
  for tarinfo in tar:
    if not tarinfo.isreg():
      continue
    n += 1
    if 0 == n % 10000:
      tf.logging.info('Scanned %d entries...', n)
    if not tarinfo.name.endswith('.trans.txt'):
      continue
    key = tarinfo.name.strip('.trans.txt')
    f = tar.extractfile(tarinfo)
    u = 0
    #beginning of part changed
    for l in f.readlines():
      uttid, txt = str(l).strip('\n').split(' ', 1)
      #add if-else condition because some uttid was in the form "b'_numbers_" and others 'b"numbers' and so one split wasn't enough 
      if len(uttid.split("b'"))>1:
        trans[uttid.split("b'")[1]] = txt
      else:
        trans[uttid.split('b"')[1]] = txt
      #end of part changed
      u += 1
    tf.logging.info('[%s] = %d utterances', key, u)
    f.close()
  return trans

In this way seems to work

alessiaatunimi avatar Aug 28 '19 22:08 alessiaatunimi

@alessiaatunimi I faced the same issue. It is fixed by using the solution you suggested. Thanks

manish-kumar-garg avatar Oct 09 '19 10:10 manish-kumar-garg