bert File "run_classifier.py", line 326, in _create_examples text_b = tokenization.convert_to_unicode(line[4]) IndexError: list index out of range

I am trying to use a custom dataset (similar to MRPC) to fine-tune the BERT model. I am running this python run_classifier.py
--task_name=mrpc
--do_train=true
--do_eval=true
--data_dir=$GLUE_DIR
--use_gpu=False
--vocab_file=$BERT_BASE_DIR/vocab.txt
--bert_config_file=$BERT_BASE_DIR/bert_config.json
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt
--max_seq_length=128
--train_batch_size=32
--learning_rate=2e-5
--num_train_epochs=3.0
--output_dir=/tmp/mrpc_output/

and getting the following error Traceback (most recent call last): File "run_classifier.py", line 981, in tf.app.run() File "/home/kddilabs/miniconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "run_classifier.py", line 842, in main train_examples = processor.get_train_examples(FLAGS.data_dir) File "run_classifier.py", line 302, in get_train_examples self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") File "run_classifier.py", line 326, in _create_examples text_b = tokenization.convert_to_unicode(line[4]) IndexError: list index out of range

My other custom datasets have run without any issue, I am getting this error only when I have increased the size of the dataset. What could be the possible reason / fix ?

Jun 26 '19 06:06 ishita-gupta98

Hi, i have the same error. I fix the error if i remove line break in file In python i do :

 df_train = pd.read_csv("data/train.tsv", header =None, sep="\t", encoding = "UTF-8", quotechar='"')
 df_bert_train = pd.DataFrame({'0':df_train[0],
                  '1':df_train[1],
                  '2':df_train[2],             
                  '3':df_train[3],             
                  '4':df_train[4].replace(r'\n',' ',regex=True)})
df_bert_train.to_csv('data/train.tsv', sep='\t', index=False, header=False, encoding="UTF-8")

Hope this helps L.

Aug 28 '19 13:08 luc-kalaora

Hi, i have the same error. I fix the error if i remove line break in file In python i do :

 df_train = pd.read_csv("data/train.tsv", header =None, sep="\t", encoding = "UTF-8", quotechar='"')
 df_bert_train = pd.DataFrame({'0':df_train[0],
                  '1':df_train[1],
                  '2':df_train[2],             
                  '3':df_train[3],             
                  '4':df_train[4].replace(r'\n',' ',regex=True)})
df_bert_train.to_csv('data/train.tsv', sep='\t', index=False, header=False, encoding="UTF-8")

Hope this helps L.

Hi, i have the same error. I fix the error if i remove line break in file In python i do :

 df_train = pd.read_csv("data/train.tsv", header =None, sep="\t", encoding = "UTF-8", quotechar='"')
 df_bert_train = pd.DataFrame({'0':df_train[0],
                  '1':df_train[1],
                  '2':df_train[2],             
                  '3':df_train[3],             
                  '4':df_train[4].replace(r'\n',' ',regex=True)})
df_bert_train.to_csv('data/train.tsv', sep='\t', index=False, header=False, encoding="UTF-8")

Hope this helps L.

Hey, your advice really worked. It solved my problem perfectly

Mar 07 '20 14:03 ZYMirror

Hi, i have the same error. I fix the error if i remove line break in file In python i do :

 df_train = pd.read_csv("data/train.tsv", header =None, sep="\t", encoding = "UTF-8", quotechar='"')
 df_bert_train = pd.DataFrame({'0':df_train[0],
                  '1':df_train[1],
                  '2':df_train[2],             
                  '3':df_train[3],             
                  '4':df_train[4].replace(r'\n',' ',regex=True)})
df_bert_train.to_csv('data/train.tsv', sep='\t', index=False, header=False, encoding="UTF-8")

Hope this helps L.

your advice worked! thanks!

Jun 10 '20 09:06 Xingyuzhao3

I have the same error but I tried the above code. And the error is not removed by this code. One more thing my error includes (split_line).

text_a = tokenization.convert_to_unicode(split_line[1]) IndexError: list index out of range

Jun 12 '21 20:06 talhach65

Hi, i have the same error. I fix the error if i remove line break in file In python i do :

 df_train = pd.read_csv("data/train.tsv", header =None, sep="\t", encoding = "UTF-8", quotechar='"')
 df_bert_train = pd.DataFrame({'0':df_train[0],
                  '1':df_train[1],
                  '2':df_train[2],             
                  '3':df_train[3],             
                  '4':df_train[4].replace(r'\n',' ',regex=True)})
df_bert_train.to_csv('data/train.tsv', sep='\t', index=False, header=False, encoding="UTF-8")

Hope this helps L.

can u tell me this code put where？

Jul 19 '21 08:07 franklee24

Hi, i have the same error. I fix the error if i remove line break in file In python i do :

 df_train = pd.read_csv("data/train.tsv", header =None, sep="\t", encoding = "UTF-8", quotechar='"')
 df_bert_train = pd.DataFrame({'0':df_train[0],
                  '1':df_train[1],
                  '2':df_train[2],             
                  '3':df_train[3],             
                  '4':df_train[4].replace(r'\n',' ',regex=True)})
df_bert_train.to_csv('data/train.tsv', sep='\t', index=False, header=False, encoding="UTF-8")

Hope this helps L.

can u tell me this code put where？

you need to replace r'\n' in your data with ' ' , like line 6 in his example

Jul 22 '21 07:07 Xingyuzhao3

bert bert copied to clipboard

File "run_classifier.py", line 326, in _create_examples text_b = tokenization.convert_to_unicode(line[4]) IndexError: list index out of range

bert
bert copied to clipboard