Mismatch between pretrained weights and imdb data?
First, I ran ./download.sh and wget http://sato-motoki.com/research/vat/imdb_pretrained_lm_ijcai.model.
Followed by the iVat train command in README.md. I've attached the output. It seems like vocab_inv is larger than the max_vocab at the time the pretrained model was made.
What is the best way to fix this?
Thanks!
train_set:71246
avg word number:244.2789911012548
vocab:87318
avg word number (train_x): 243.84721829991528
avg word number (dev_x):241.3660095897709
avg word number (test_x):236.99672
lm_words_num:17397769
train_vocab_size: 67054
vocab_inv: 87318
Traceback (most recent call last):
File "train.py", line 427, in <module>
main()
File "train.py", line 181, in main
serializers.load_npz(args.pretrained_model, pretrain_model)
File "/data/anaconda/envs/tf17py3/lib/python3.6/site-packages/chainer/serializers/npz.py", line 190, in load_npz
d.load(obj)
File "/data/anaconda/envs/tf17py3/lib/python3.6/site-packages/chainer/serializer.py", line 83, in load
obj.serialize(self)
File "/data/anaconda/envs/tf17py3/lib/python3.6/site-packages/chainer/link.py", line 1001, in serialize
d[name].serialize(serializer[name])
File "/data/anaconda/envs/tf17py3/lib/python3.6/site-packages/chainer/link.py", line 651, in serialize
data = serializer(name, param.data)
File "/data/anaconda/envs/tf17py3/lib/python3.6/site-packages/chainer/serializers/npz.py", line 150, in __call__
numpy.copyto(value, dataset)
ValueError: could not broadcast input array from shape (86935,256) into shape (87318,256)
Could you fix the problem somehow? I seem to run into the same problem with the pretrained weights. I was getting an error with the encoding of the data in the preprocessing, then I switched to utf-8 encoding and the preprocessing worked alright. Then I get the error you are getting while loading the pretrained weights. It isn't specified anywhere in the code but do preprocessed data used for pretrained weights use somehow a different encoding than utf-8? Thanks for the interest.
Prepare for IMDB
Prepare script is running...
Traceback (most recent call last):
File "preprocess.py", line 79, in <module>
prepare_imdb()
File "preprocess.py", line 55, in prepare_imdb
imdb_validation_pos_start_id)
File "preprocess.py", line 24, in load_file
words = read_text(filename.strip())
File "preprocess.py", line 11, in read_text
for line in f:
File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 399: ordinal not in range(128)
train_set:71246
avg word number:242.8615501221121
vocab:87008
avg word number (train_x): 242.43914148545608
avg word number (dev_x):239.861747469366
avg word number (test_x):235.59372
lm_words_num:17297560
train_vocab_size: 66825
vocab_inv: 87008
Traceback (most recent call last):
File "train.py", line 427, in <module>
main()
File "train.py", line 181, in main
serializers.load_npz(args.pretrained_model, pretrain_model)
File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/serializers/npz.py", line 242, in load_npz
d.load(obj)
File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/serializer.py", line 83, in load
obj.serialize(self)
File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/link.py", line 1036, in serialize
d[name].serialize(serializer[name])
File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/link.py", line 1033, in serialize
super(Chain, self).serialize(serializer)
File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/link.py", line 655, in serialize
data = serializer(name, param.data) # type: types.NdArray
File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/serializers/npz.py", line 184, in __call__
numpy.copyto(value, dataset)
ValueError: could not broadcast input array from shape (76935,64) into shape (77008,64)```