adversarial_text icon indicating copy to clipboard operation
adversarial_text copied to clipboard

ValueError: could not broadcast input array from shape (86935,256) into shape (87008,256)

Open p-null opened this issue 6 years ago • 5 comments

Hi, when i try to run download.sh, i have the following error:

Prepare for IMDB
Prepare script is running...
Traceback (most recent call last):
  File "preprocess.py", line 79, in <module>
    prepare_imdb()
  File "preprocess.py", line 55, in prepare_imdb
    imdb_validation_pos_start_id)
  File "preprocess.py", line 24, in load_file
    words = read_text(filename.strip())
  File "preprocess.py", line 11, in read_text
    for line in f:
  File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 399: ordinal not in range(128)

Then i added encoding='utf-8' at every with open() in preprocessing.py

After that, i have the following error:

Namespace(adaptive_softmax=1, add_labeld_to_unlabel=1, alpha=0.001, alpha_decay=0.9998, batchsize=32, batchsize_semi=96, clip=5.0, dataset='imdb', debug_mode=0, dropout=0.5, emb_dim=256, eval=0, freeze_word_emb=0, gpu=0, hidden_cls_dim=30, hidden_dim=1024, ignore_unk=1, load_trained_lstm='', lower=0, min_count=1, n_class=2, n_epoch=30, n_layers=1, nl_factor=1.0, norm_sentence_level=1, pretrained_model='imdb_pretrained_lm.model', random_seed=1234, save_name='imdb_model_vat', use_adv=0, use_exp_decay=1, use_rational=0, use_semi_data=1, use_unlabled_to_vocab=1, word_only=0, xi_var=5.0, xi_var_first=1.0)
train_set:71246
avg word number:242.8615501221121
vocab:87008
avg word number (train_x): 242.43914148545608
avg word number (dev_x):239.861747469366
avg word number (test_x):235.59372
lm_words_num:17297560
train_vocab_size: 66825
vocab_inv: 87008
Traceback (most recent call last):
  File "train.py", line 354, in <module>
    main()
  File "train.py", line 164, in main
    serializers.load_npz(args.pretrained_model, pretrain_model)
  File "/usr/local/lib/python3.6/dist-packages/chainer/serializers/npz.py", line 190, in load_npz
    d.load(obj)
  File "/usr/local/lib/python3.6/dist-packages/chainer/serializer.py", line 83, in load
    obj.serialize(self)
  File "/usr/local/lib/python3.6/dist-packages/chainer/link.py", line 997, in serialize
    d[name].serialize(serializer[name])
  File "/usr/local/lib/python3.6/dist-packages/chainer/link.py", line 651, in serialize
    data = serializer(name, param.data)
  File "/usr/local/lib/python3.6/dist-packages/chainer/serializers/npz.py", line 150, in __call__
    numpy.copyto(value, dataset)
ValueError: could not broadcast input array from shape (86935,256) into shape (87008,256)

I guess it is my modifying the decoding method that throws out some lines in file? Could you give me a workout on this issue?

p-null avatar Nov 20 '18 21:11 p-null

Thank you for your report!

Could you try to use following command?

$ cd data/imdb/
$ wget http://sato-motoki.com/research/vat/imdb_list.zip
$ unzip imdb_list.zip

aonotas avatar Nov 22 '18 04:11 aonotas

Hi, Still got the same error. To help reproduce the error, i upload a notebook here Thanks!

p-null avatar Nov 22 '18 21:11 p-null

Thank you for your notebook.

$ cd data/imdb/
$ wget http://sato-motoki.com/research/vat/imdb_list.zip
$ unzip imdb_list.zip

Then add encoding='utf-8' at every with open() in preprocessing.py

Please let me know the result!

aonotas avatar Nov 24 '18 01:11 aonotas

Hi, Still got the same error. To help reproduce the error, i upload a notebook here Thanks!

Did you fix it. And may I know how ?!

longquan0104 avatar Jun 16 '19 16:06 longquan0104

No progress on this one? I can train my own LM than it seems to load the weights alright, but couldn't make it with the pretrained weights.

dcetin avatar Jun 22 '19 11:06 dcetin