Prediction got error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: unexpected end of data
I am seeing a problem somewhat similar to https://github.com/google/seq2seq/issues/170 but slightly different. In my case:
- Was able to train a character-level NMT model without problem. Both source and target files in UTF8 encoding, using python3 throughout. The model converges slowly and steadily, all look normal.
- However, when doing prediction I then see the UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: unexpected end of data message (where the 0xe8 part varies depends on what I have in the test source)
My observations:
- If I change the test data to contain only English, then error does not occur, but then the prediction result contains gibberish as expected (since the model was never trained in English)
- If I insert even one Unicode character anywhere into the test data file, then the error shows up. Note that the model was trained entirely in Unicode at character level.
- For prediction (see my script below) if I change the parameter source_delimiter to ' ' (i.e., the ASCII space character, as opposed to using an empty string to enforce character-level prediction) then the problem error does not happen, but then the predicted output contains only gibberish as expected (since the model was trained with delimiter set to the empty string).
Anybody has insight on how to deal with this problem?
Here is my training script:
#!/bin/bash
BASE=/home/kaihu/ml/tests/seq2seq
BASE2=.
TRAIN_SOURCES=${BASE2}/train/sources.txt
TRAIN_TARGETS=${BASE2}/train/targets.txt
VOCAB_SOURCE=${BASE2}/train/vocab.sources.txt
VOCAB_TARGET=${BASE2}/train/vocab.targets.txt
DEV_SOURCES=${BASE2}/dev/sources.txt
DEV_TARGETS=${BASE2}/dev/targets.txt
MODEL_DIR=${BASE2}/models
python3 ${BASE}/bin/train.py --config_paths ${BASE}/example_configs/nmt_conv_small.yml \
--model_params "
vocab_source: $VOCAB_SOURCE
vocab_target: $VOCAB_TARGET" \
--input_pipeline_train "
class: ParallelTextInputPipeline
params:
source_delimiter: ''
source_files:
- $TRAIN_SOURCES
target_delimiter: ''
target_files:
- $TRAIN_TARGETS" \
--input_pipeline_dev "
class: ParallelTextInputPipeline
params:
source_delimiter: ''
source_files:
- $DEV_SOURCES
target_delimiter: ''
target_files:
- $DEV_TARGETS" \
--batch_size 64 \
--gpu_allow_growth True \
--train_steps 1000000 \
--eval_every_n_steps 2000 \
--output_dir $MODEL_DIR
Here is my prediction script:
#!/bin/bash
BASE=/home/kaihu/ml/tests/seq2seq
MODEL_DIR=./models
DEV_SOURCES=./dev/sources.txt
export PRED_DIR=.
python3 ${BASE}/bin/infer.py \
--tasks "
- class: DecodeText
params:
delimiter: ''
unk_replace: False" \
--model_dir $MODEL_DIR \
--input_pipeline "
class: ParallelTextInputPipeline
params:
source_delimiter: ''
source_files:
- $DEV_SOURCES" \
> ${PRED_DIR}/predictions.txt
Here is the trace from running the prediction script:
$ ./predict
INFO:tensorflow:Creating ParallelTextInputPipeline in mode=infer
INFO:tensorflow:
ParallelTextInputPipeline:
num_epochs: 1
shuffle: false
source_delimiter: ''
source_files: [./dev/sources.txt]
target_delimiter: ' '
target_files: []
INFO:tensorflow:Creating AttentionSeq2Seq in mode=infer
INFO:tensorflow:
AttentionSeq2Seq:
attention.class: seq2seq.decoders.attention.AttentionLayerBahdanau
attention.params: {num_units: 128}
bridge.class: seq2seq.models.bridges.ZeroBridge
bridge.params: {}
decoder.class: seq2seq.decoders.AttentionDecoder
decoder.params:
rnn_cell:
cell_class: GRUCell
cell_params: {num_units: 128}
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 1
embedding.dim: 128
embedding.init_scale: 0.04
embedding.share: false
encoder.class: seq2seq.encoders.ConvEncoder
encoder.params: {attention_cnn.kernel_size: 3, attention_cnn.layers: 6, attention_cnn.units: 128,
output_cnn.kernel_size: 3, output_cnn.layers: 3, output_cnn.units: 128, position_embeddings.combiner_fn: tensorflow.multiply,
position_embeddings.enable: true, position_embeddings.num_positions: 52}
inference.beam_search.beam_width: 0
inference.beam_search.choose_successors_fn: choose_top_k
inference.beam_search.length_penalty_weight: 0.0
optimizer.clip_embed_gradients: 0.1
optimizer.clip_gradients: 5.0
optimizer.learning_rate: 0.0001
optimizer.lr_decay_rate: 0.99
optimizer.lr_decay_steps: 100
optimizer.lr_decay_type: ''
optimizer.lr_min_learning_rate: 1.0e-12
optimizer.lr_staircase: false
optimizer.lr_start_decay_at: 0
optimizer.lr_stop_decay_at: 2147483647
optimizer.name: Adam
optimizer.params: {}
optimizer.sync_replicas: 0
optimizer.sync_replicas_to_aggregate: 0
source.max_seq_len: 50
source.reverse: false
target.max_seq_len: 50
vocab_source: ./train/vocab.sources.txt
vocab_target: ./train/vocab.targets.txt
INFO:tensorflow:Creating DecodeText in mode=infer
INFO:tensorflow:
DecodeText: {delimiter: '', postproc_fn: '', unk_mapping: null, unk_replace: false}
INFO:tensorflow:Creating vocabulary lookup table of size 3711
INFO:tensorflow:Creating vocabulary lookup table of size 3711
INFO:tensorflow:Creating ConvEncoder in mode=infer
INFO:tensorflow:
ConvEncoder: {attention_cnn.kernel_size: 3, attention_cnn.layers: 6, attention_cnn.units: 128,
embedding_dropout_keep_prob: 0.8, output_cnn.kernel_size: 3, output_cnn.layers: 3,
output_cnn.units: 128, position_embeddings.combiner_fn: tensorflow.multiply, position_embeddings.enable: true,
position_embeddings.num_positions: 52}
INFO:tensorflow:Creating AttentionLayerBahdanau in mode=infer
INFO:tensorflow:
AttentionLayerBahdanau: {num_units: 128}
INFO:tensorflow:Creating AttentionDecoder in mode=infer
INFO:tensorflow:
AttentionDecoder:
init_scale: 0.04
max_decode_length: 100
rnn_cell:
cell_class: GRUCell
cell_params: {num_units: 128}
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 1
residual_combiner: add
residual_connections: false
residual_dense: false
INFO:tensorflow:Creating ZeroBridge in mode=infer
INFO:tensorflow:
ZeroBridge: {}
2017-09-20 10:30:25.813803: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-20 10:30:25.813824: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-20 10:30:25.813831: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-20 10:30:25.813836: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-20 10:30:25.813841: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-09-20 10:30:25.816217: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2017-09-20 10:30:25.816249: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: turing5
2017-09-20 10:30:25.816256: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: turing5
2017-09-20 10:30:25.816282: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 375.82.0
2017-09-20 10:30:25.816303: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 375.82 Wed Jul 19 21:16:49 PDT 2017
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
"""
2017-09-20 10:30:25.816317: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 375.82.0
2017-09-20 10:30:25.816323: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 375.82.0
INFO:tensorflow:Restoring parameters from ./models/model.ckpt-210003
INFO:tensorflow:Restored model from ./models/model.ckpt-210003
Traceback (most recent call last):
File "/home/kaihu/ml/tests/seq2seq/bin/infer.py", line 129, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/kaihu/ml/tests/seq2seq/bin/infer.py", line 125, in main
sess.run([])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 518, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 862, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 818, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 980, in run
run_metadata=run_metadata))
File "/home/kaihu/ml/tests/seq2seq/seq2seq/tasks/decode_text.py", line 165, in after_run
fetches["features.source_tokens"].astype("S"), "utf-8")
File "/home/kaihu/.local/lib/python3.5/site-packages/numpy/core/defchararray.py", line 505, in decode
_vec_string(a, object_, 'decode', _clean_args(encoding, errors)))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: unexpected end of data
Upon further investigation I found that the error came from line #164 in decode_text.py:
fetches["features.source_tokens"] = np.char.decode(
fetches["features.source_tokens"].astype("S"), 'utf-8')
If I change the statement to the following:
fetches["features.source_tokens"] = np.char.decode(
b''.join(fetches["features.source_tokens"][:-1]), 'utf-8')
then the problem went away, and the following statement:
print( fetches["features.source_tokens"] )
also displays the correct Unicode string from the test dataset. However, it is also found that the predicted output (i.e., fetches["predicted_tokens"]) contains nothing but a bunch of b'UNK' even though the training process appears to converge to a small loss of 0.01, and test data is in fact a subset of the original training data (used here just for testing this problem).
I wonder if anybody has successfully used seq2seq for training a character-level NMT model using Unicode dataset? The above evidence seems to indicate that the prediction phase shouldn't have worked, and the training part is kind of iffy. It is of course also possible that I have made a stupid mistake somewhere, and in which case any advise would be much appreciated.
To answer my own question on why training with character-level multibyte UTF-8 data results in a useless model: I believe the training phase has failed due to a bug in Tensorflow. There is a call to tf.string_split in seq2seq.data.split_tokens_decoder.py, and per Tensorflow API: If delimiter is an empty string, each element of the source is split into individual strings, each containing one byte. (This includes splitting multibyte sequences of UTF-8.), which is the wrong thing to do. Also found a Tensorflow pull request from just a few days ago which seems to confirm my reasoning.
#153