seq2seq Prediction got error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: unexpected end of data

I am seeing a problem somewhat similar to https://github.com/google/seq2seq/issues/170 but slightly different. In my case:

Was able to train a character-level NMT model without problem. Both source and target files in UTF8 encoding, using python3 throughout. The model converges slowly and steadily, all look normal.
However, when doing prediction I then see the UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: unexpected end of data message (where the 0xe8 part varies depends on what I have in the test source)

My observations:

If I change the test data to contain only English, then error does not occur, but then the prediction result contains gibberish as expected (since the model was never trained in English)
If I insert even one Unicode character anywhere into the test data file, then the error shows up. Note that the model was trained entirely in Unicode at character level.
For prediction (see my script below) if I change the parameter source_delimiter to ' ' (i.e., the ASCII space character, as opposed to using an empty string to enforce character-level prediction) then the problem error does not happen, but then the predicted output contains only gibberish as expected (since the model was trained with delimiter set to the empty string).

Anybody has insight on how to deal with this problem?

Here is my training script:

#!/bin/bash
BASE=/home/kaihu/ml/tests/seq2seq
BASE2=.

TRAIN_SOURCES=${BASE2}/train/sources.txt
TRAIN_TARGETS=${BASE2}/train/targets.txt
VOCAB_SOURCE=${BASE2}/train/vocab.sources.txt
VOCAB_TARGET=${BASE2}/train/vocab.targets.txt
DEV_SOURCES=${BASE2}/dev/sources.txt
DEV_TARGETS=${BASE2}/dev/targets.txt

MODEL_DIR=${BASE2}/models

python3 ${BASE}/bin/train.py --config_paths ${BASE}/example_configs/nmt_conv_small.yml \
  --model_params "
      vocab_source: $VOCAB_SOURCE
      vocab_target: $VOCAB_TARGET" \
  --input_pipeline_train "
    class: ParallelTextInputPipeline
    params:
      source_delimiter: ''
      source_files:
        - $TRAIN_SOURCES
      target_delimiter: ''
      target_files:
        - $TRAIN_TARGETS" \
  --input_pipeline_dev "
    class: ParallelTextInputPipeline
    params:
      source_delimiter: ''
      source_files:
        - $DEV_SOURCES
      target_delimiter: ''
      target_files:
        - $DEV_TARGETS" \
  --batch_size 64 \
  --gpu_allow_growth True \
  --train_steps 1000000 \
  --eval_every_n_steps 2000 \
  --output_dir $MODEL_DIR

Here is my prediction script:

#!/bin/bash
BASE=/home/kaihu/ml/tests/seq2seq
MODEL_DIR=./models 

DEV_SOURCES=./dev/sources.txt
export PRED_DIR=.

python3 ${BASE}/bin/infer.py \
  --tasks "
    - class: DecodeText
      params:
        delimiter: ''
        unk_replace: False" \
  --model_dir $MODEL_DIR \
  --input_pipeline "
    class: ParallelTextInputPipeline
    params:
      source_delimiter: ''
      source_files:
        - $DEV_SOURCES" \
  >  ${PRED_DIR}/predictions.txt

Here is the trace from running the prediction script:

$ ./predict 
INFO:tensorflow:Creating ParallelTextInputPipeline in mode=infer
INFO:tensorflow:
ParallelTextInputPipeline:
  num_epochs: 1
  shuffle: false
  source_delimiter: ''
  source_files: [./dev/sources.txt]
  target_delimiter: ' '
  target_files: []

INFO:tensorflow:Creating AttentionSeq2Seq in mode=infer
INFO:tensorflow:
AttentionSeq2Seq:
  attention.class: seq2seq.decoders.attention.AttentionLayerBahdanau
  attention.params: {num_units: 128}
  bridge.class: seq2seq.models.bridges.ZeroBridge
  bridge.params: {}
  decoder.class: seq2seq.decoders.AttentionDecoder
  decoder.params:
    rnn_cell:
      cell_class: GRUCell
      cell_params: {num_units: 128}
      dropout_input_keep_prob: 0.8
      dropout_output_keep_prob: 1.0
      num_layers: 1
  embedding.dim: 128
  embedding.init_scale: 0.04
  embedding.share: false
  encoder.class: seq2seq.encoders.ConvEncoder
  encoder.params: {attention_cnn.kernel_size: 3, attention_cnn.layers: 6, attention_cnn.units: 128,
    output_cnn.kernel_size: 3, output_cnn.layers: 3, output_cnn.units: 128, position_embeddings.combiner_fn: tensorflow.multiply,
    position_embeddings.enable: true, position_embeddings.num_positions: 52}
  inference.beam_search.beam_width: 0
  inference.beam_search.choose_successors_fn: choose_top_k
  inference.beam_search.length_penalty_weight: 0.0
  optimizer.clip_embed_gradients: 0.1
  optimizer.clip_gradients: 5.0
  optimizer.learning_rate: 0.0001
  optimizer.lr_decay_rate: 0.99
  optimizer.lr_decay_steps: 100
  optimizer.lr_decay_type: ''
  optimizer.lr_min_learning_rate: 1.0e-12
  optimizer.lr_staircase: false
  optimizer.lr_start_decay_at: 0
  optimizer.lr_stop_decay_at: 2147483647
  optimizer.name: Adam
  optimizer.params: {}
  optimizer.sync_replicas: 0
  optimizer.sync_replicas_to_aggregate: 0
  source.max_seq_len: 50
  source.reverse: false
  target.max_seq_len: 50
  vocab_source: ./train/vocab.sources.txt
  vocab_target: ./train/vocab.targets.txt

INFO:tensorflow:Creating DecodeText in mode=infer
INFO:tensorflow:
DecodeText: {delimiter: '', postproc_fn: '', unk_mapping: null, unk_replace: false}

INFO:tensorflow:Creating vocabulary lookup table of size 3711
INFO:tensorflow:Creating vocabulary lookup table of size 3711
INFO:tensorflow:Creating ConvEncoder in mode=infer
INFO:tensorflow:
ConvEncoder: {attention_cnn.kernel_size: 3, attention_cnn.layers: 6, attention_cnn.units: 128,
  embedding_dropout_keep_prob: 0.8, output_cnn.kernel_size: 3, output_cnn.layers: 3,
  output_cnn.units: 128, position_embeddings.combiner_fn: tensorflow.multiply, position_embeddings.enable: true,
  position_embeddings.num_positions: 52}

INFO:tensorflow:Creating AttentionLayerBahdanau in mode=infer
INFO:tensorflow:
AttentionLayerBahdanau: {num_units: 128}

INFO:tensorflow:Creating AttentionDecoder in mode=infer
INFO:tensorflow:
AttentionDecoder:
  init_scale: 0.04
  max_decode_length: 100
  rnn_cell:
    cell_class: GRUCell
    cell_params: {num_units: 128}
    dropout_input_keep_prob: 0.8
    dropout_output_keep_prob: 1.0
    num_layers: 1
    residual_combiner: add
    residual_connections: false
    residual_dense: false

INFO:tensorflow:Creating ZeroBridge in mode=infer
INFO:tensorflow:
ZeroBridge: {}

2017-09-20 10:30:25.813803: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-20 10:30:25.813824: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-20 10:30:25.813831: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-20 10:30:25.813836: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-20 10:30:25.813841: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-09-20 10:30:25.816217: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2017-09-20 10:30:25.816249: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: turing5
2017-09-20 10:30:25.816256: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: turing5
2017-09-20 10:30:25.816282: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 375.82.0
2017-09-20 10:30:25.816303: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  375.82  Wed Jul 19 21:16:49 PDT 2017
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
"""
2017-09-20 10:30:25.816317: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 375.82.0
2017-09-20 10:30:25.816323: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 375.82.0
INFO:tensorflow:Restoring parameters from ./models/model.ckpt-210003
INFO:tensorflow:Restored model from ./models/model.ckpt-210003
Traceback (most recent call last):
  File "/home/kaihu/ml/tests/seq2seq/bin/infer.py", line 129, in 
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/kaihu/ml/tests/seq2seq/bin/infer.py", line 125, in main
    sess.run([])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 518, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 862, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 818, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 980, in run
    run_metadata=run_metadata))
  File "/home/kaihu/ml/tests/seq2seq/seq2seq/tasks/decode_text.py", line 165, in after_run
    fetches["features.source_tokens"].astype("S"), "utf-8")
  File "/home/kaihu/.local/lib/python3.5/site-packages/numpy/core/defchararray.py", line 505, in decode
    _vec_string(a, object_, 'decode', _clean_args(encoding, errors)))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: unexpected end of data

Sep 20 '17 15:09 kaihuchen

Upon further investigation I found that the error came from line #164 in decode_text.py:

  fetches["features.source_tokens"] = np.char.decode(
          fetches["features.source_tokens"].astype("S"), 'utf-8')

If I change the statement to the following:

  fetches["features.source_tokens"] = np.char.decode(
      b''.join(fetches["features.source_tokens"][:-1]), 'utf-8')

then the problem went away, and the following statement:

print( fetches["features.source_tokens"] )

also displays the correct Unicode string from the test dataset. However, it is also found that the predicted output (i.e., fetches["predicted_tokens"]) contains nothing but a bunch of b'UNK' even though the training process appears to converge to a small loss of 0.01, and test data is in fact a subset of the original training data (used here just for testing this problem).

I wonder if anybody has successfully used seq2seq for training a character-level NMT model using Unicode dataset? The above evidence seems to indicate that the prediction phase shouldn't have worked, and the training part is kind of iffy. It is of course also possible that I have made a stupid mistake somewhere, and in which case any advise would be much appreciated.

Sep 21 '17 01:09 kaihuchen

To answer my own question on why training with character-level multibyte UTF-8 data results in a useless model: I believe the training phase has failed due to a bug in Tensorflow. There is a call to tf.string_split in seq2seq.data.split_tokens_decoder.py, and per Tensorflow API: If delimiter is an empty string, each element of the source is split into individual strings, each containing one byte. (This includes splitting multibyte sequences of UTF-8.), which is the wrong thing to do. Also found a Tensorflow pull request from just a few days ago which seems to confirm my reasoning.

Sep 22 '17 16:09 kaihuchen

#153

Sep 22 '17 16:09 shubhamagarwal92