ARAE icon indicating copy to clipboard operation
ARAE copied to clipboard

Error when running yelp/train.py

Open jiwoongim opened this issue 6 years ago • 9 comments

I followed README.md and ran python train.py --data_path ./data

But then I got the following errors:

{'dropout': 0.0, 'lr_ae': 1, 'load_vocab': '', 'nlayers': 1, 'batch_size': 64, 'beta1': 0.5, 'gan_gp_lambda': 0.1, 'nhidden': 128, 'vocab_size': 30000, 'niters_gan_schedule': '', 'niters_gan_d': 5, 'lr_gan_d': 0.0001, 'grad_lambda': 0.01, 'sample': False, 'arch_classify': '128-128', 'clip': 1, 'hidden_init': False, 'cuda': True, 'log_interval': 200, 'device_id': '0', 'temp': 1, 'seed': 1111, 'maxlen': 25, 'lowercase': True, 'data_path': './data', 'lambda_class': 1, 'lr_classify': 0.0001, 'outf': 'yelp_example', 'noise_r': 0.1, 'noise_anneal': 0.9995, 'lr_gan_g': 0.0001, 'niters_gan_g': 1, 'arch_g': '128-128', 'z_size': 32, 'epochs': 25, 'niters_ae': 1, 'arch_d': '128-128', 'emsize': 128, 'niters_gan_ae': 1}
Original vocab 9599; Pruned to 9603
Number of sentences dropped from ./data/valid1.txt: 0 out of 38205 total
Number of sentences dropped from ./data/valid2.txt: 0 out of 25278 total
Number of sentences dropped from ./data/train1.txt: 0 out of 267314 total
Number of sentences dropped from ./data/train2.txt: 0 out of 176787 total
Vocabulary Size: 9603
382 batches
252 batches
4176 batches
2762 batches
Loaded data!
Seq2Seq2Decoder(
  (embedding): Embedding(9603, 128)
  (embedding_decoder1): Embedding(9603, 128)
  (embedding_decoder2): Embedding(9603, 128)
  (encoder): LSTM(128, 128, batch_first=True)
  (decoder1): LSTM(256, 128, batch_first=True)
  (decoder2): LSTM(256, 128, batch_first=True)
  (linear): Linear(in_features=128, out_features=9603, bias=True)
)
MLP_G(
  (layer1): Linear(in_features=32, out_features=128, bias=True)
  (bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (activation1): ReLU()
  (layer2): Linear(in_features=128, out_features=128, bias=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (activation2): ReLU()
  (layer7): Linear(in_features=128, out_features=128, bias=True)
)
MLP_D(
  (layer1): Linear(in_features=128, out_features=128, bias=True)
  (activation1): LeakyReLU(negative_slope=0.2)
  (layer2): Linear(in_features=128, out_features=128, bias=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (activation2): LeakyReLU(negative_slope=0.2)
  (layer6): Linear(in_features=128, out_features=1, bias=True)
)
MLP_Classify(
  (layer1): Linear(in_features=128, out_features=128, bias=True)
  (activation1): ReLU()
  (layer2): Linear(in_features=128, out_features=128, bias=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (activation2): ReLU()
  (layer6): Linear(in_features=128, out_features=1, bias=True)
)
Training...
Traceback (most recent call last):
  File "train.py", line 574, in <module>
    train_ae(1, train1_data[niter], total_loss_ae1, start_time, niter)
  File "train.py", line 400, in train_ae
    output = autoencoder(whichdecoder, source, lengths, noise=True)
  File "/localhome/imd/anaconda2/envs/Pytorch/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/groups/branson/home/imd/Documents/project/ARAE/yelp/models.py", line 143, in forward
    hidden = self.encode(indices, lengths, noise)
  File "/groups/branson/home/imd/Documents/project/ARAE/yelp/models.py", line 160, in encode
    batch_first=True)
  File "/localhome/imd/anaconda2/envs/Pytorch/lib/python3.5/site-packages/torch/onnx/__init__.py", line 56, in wrapper
    if not might_trace(args):
  File "/localhome/imd/anaconda2/envs/Pytorch/lib/python3.5/site-packages/torch/onnx/__init__.py", line 130, in might_trace
    first_arg = args[0]
IndexError: tuple index out of range

jiwoongim avatar Jul 24 '18 15:07 jiwoongim

Hmm, could you try maybe run with python3?

jakezhaojb avatar Jul 30 '18 11:07 jakezhaojb

I've run into the same issue. Python 3.5.2 torch==0.4.1

Training...     
Traceback (most recent call last):
  File "train.py", line 574, in <module>
    train_ae(1, train1_data[niter], total_loss_ae1, start_time, niter)                                                       
  File "train.py", line 400, in train_ae
    output = autoencoder(whichdecoder, source, lengths, noise=True)
  File "/home/v2john/.pyenv/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__                   
    result = self.forward(*input, **kwargs)
  File "/home/v2john/ARAE/yelp/models.py", line 143, in forward
    hidden = self.encode(indices, lengths, noise)                                                                            
  File "/home/v2john/ARAE/yelp/models.py", line 160, in encode
    batch_first=True)
  File "/home/v2john/.pyenv/lib/python3.5/site-packages/torch/onnx/__init__.py", line 67, in wrapper                         
    if not might_trace(args):
  File "/home/v2john/.pyenv/lib/python3.5/site-packages/torch/onnx/__init__.py", line 141, in might_trace
    first_arg = args[0]                                                                                                      
IndexError: tuple index out of range

Python3 clearly isn't the fix. It seems like something about the PyTorch + ONNX interop is broken. Is there a specific version of PyTorch that's needed to run this?

vineetjohn avatar Aug 17 '18 17:08 vineetjohn

@jiwoongim

You can try using my forked version of the repository to see if it fixes the issue for you. I've verified it to be working for Python 3.5.2 and PyTorch 0.4.1 https://github.com/vineetjohn/arae

I've not identified the actual problem yet, but I've added a workaround that avoids having to deal with ONNX altogether. The pack_padded_sequence method in torch.nn.utils.rnn seems to be buggy.

vineetjohn avatar Aug 19 '18 01:08 vineetjohn

Guys can you try python 3.6? @jiwoongim @vineetjohn

jakezhaojb avatar Aug 24 '18 15:08 jakezhaojb

@jiwoongim You can try using my forked version of the repository, I have resolved the issue by doing several changes to the original code. I have verified it to be working for python 3.6.5 and PyTorch 0.4.1 https://github.com/rainyrainyguo/ARAE

rainyrainyguo avatar Aug 24 '18 23:08 rainyrainyguo

@jakezhaojb

This doesn't look like a Python version issue. The named arguments used in this project vs. those accepted by PyTorch 0.4.1 are inconsistent.

You should consider adding the version of PyTorch used to perform your experiments, to the project README.

vineetjohn avatar Aug 28 '18 14:08 vineetjohn

@vineetjohn Good point! I used PyTorch 0.3.1. I'm adding this to the README

jakezhaojb avatar Aug 30 '18 18:08 jakezhaojb

@rainyrainyguo I have run your forked version in python 3.6.5 with PyTorch 0.4.1 (Cudnn=7.1.3, Cudatoolkit=8.0) and I have a error as follow: Training .... run_oneb.py:256: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_ . run_oneb.py:259: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() t o convert a 0-dim tensor to a Python number run_oneb.py:263: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() t o convert a 0-dim tensor to a Python number | epoch 1 | 0/ 765 batches | ms/batch 0.61 | loss 0.05 | ppl 1.05 | acc 0.00 Traceback (most recent call last): File "run_oneb.py", line 102, in exec(open("train.py").read()) File "", line 434, in File "", line 395, in train File "", line 324, in train_gan_d File "/home/thindv/anaconda3/envs/ARAE/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/thindv/anaconda3/envs/ARAE/lib/python3.6/site-packages/torch/autograd/init.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: invalid gradient at index 0 - expected shape [] but got [1]

Can you give me some advices?

dangvanthin avatar Dec 22 '18 04:12 dangvanthin

@dangvanthin Hi, I met the same problem. Do you have the solution right now? Thank you

V-Enzo avatar Mar 12 '20 10:03 V-Enzo