HRL-RE icon indicating copy to clipboard operation
HRL-RE copied to clipboard

Pretraining works fine, but rl training stays at 0 Accuracy

Open BenjaminWinter opened this issue 6 years ago • 10 comments

Running:

  • Pytorch 0.3.1
  • Python 3.5.2

The RL training doesn't work for me for the NYT10 dataset (havent checked other yet). I first ran pretraining for 10 epochs with:

python main.py --epochPre 10 --numprocess 8 --datapath ../data/NYT10/ --pretrain True

which gets roughly 58 F1 on the test set, and then afterwards try the RL training with:

python main.py --epochRL 10 --numprocess 8 --start checkpoints/model_HRL_10 --datapath ../data/NYT10/

I stopped RL training after 3 epochs because not only was dev and test set F1 at 0, even training accuracy is 0. Loss started at around 30, then after only 60 batches decreases to about -20, then slowly increases again and ends up hovering around -0.00005 Checking the optimize() method all reward arrays contain either straight 0's or negative numbers.

BenjaminWinter avatar Nov 30 '18 16:11 BenjaminWinter

Try to reduce the learning rate plz. I‘m not sure what's going wrong, but it worth a try.

keavil avatar Dec 01 '18 09:12 keavil

thank you for your quick reply. I tried that over the weekend,and a lower learning rate (0.00002) indeed helped a little bit. The accuracy is not pinned at 0 anymore, but it still stays in single digits. Shouldn't it already start higher due to the pretraining?

Would it be possible for you to share a pretraining model and a set of hyperparameters that work for you?

BenjaminWinter avatar Dec 03 '18 09:12 BenjaminWinter

I have just rerun the pretraining for the NYT10 dataset with

python3 main.py --datapath ../data/NYT10/ --pretrain True

and gets about 62 F1 on the test set. Here's the log output

epoch 0: dev F1: 0.5301069217782779, test F1: 0.46435134141859613 epoch 1: dev F1: 0.6318181818181818, test F1: 0.5483187471211423 epoch 2: dev F1: 0.6612162616194424, test F1: 0.5576974804985205 epoch 3: dev F1: 0.715510522213562, test F1: 0.6136810144668691 epoch 4: dev F1: 0.7223641817575825, test F1: 0.6080997979443029 epoch 5: dev F1: 0.7285908473040326, test F1: 0.6140748120641246 epoch 6: dev F1: 0.7290673172895641, test F1: 0.6133403731080604 epoch 7: dev F1: 0.7419283010465375, test F1: 0.62323850039883 epoch 8: dev F1: 0.7316315205327415, test F1: 0.6132978723404255 epoch 9: dev F1: 0.7432131731197152, test F1: 0.6178214317317052 epoch 10: dev F1: 0.7410535674594355, test F1: 0.6209842484648929 epoch 11: dev F1: 0.7460694491573352, test F1: 0.6244008320520936 epoch 12: dev F1: 0.7455326849129156, test F1: 0.6203670385030586 epoch 13: dev F1: 0.7360536612632756, test F1: 0.6146158650843222 epoch 14: dev F1: 0.75188138829608, test F1: 0.6181299072091364

Then I train the model using RL with

python3 main.py --lr 2e-5 --datapath ../data/NYT10/ --start checkpoints/model_HRL_10

and the F1 score continues to rise,

epoch 0: dev F1: 0.7637886897835698, test F1: 0.6370854740775339 epoch 1: dev F1: 0.7609631266720949, test F1: 0.6375350140056023 epoch 2: dev F1: 0.7648014859530996, test F1: 0.6340052258305339

The model is still training, only the logs of the first 3 epochs are quoted here.

Environment:

  • Python 3.5.2
  • Pytorch 0.3.1

truthless11 avatar Dec 06 '18 02:12 truthless11

Similar question.@BenjaminWinter

Train the model with python main.py --lr 2e-5 --datapath ../data/NYT10/

Dev and test set F1 are 0, training accuracy is 0 on each training epoh while the loss is continuing decline After training 2 epochs, all sentences' top_actions are[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0]

Environment: Python 3.5.2 Pytorch 1.0.1(Have change the code to Pytorch1.0 and no error came out yet ) @truthless11

misaki-sysu avatar Mar 14 '19 02:03 misaki-sysu

@misaki-sysu I meet the same question, test set F1 are 0. Have you solved this problem?

WJYw avatar Jun 04 '19 13:06 WJYw

I have pre-trained first, and the learning rate was 0.00002, but the accuracy was 0 when I tested the test set. python main.py test --test True checkpoints/model_HRL_10 The result is: 0 148107 5628 0 150425 5713 0 152828 5803 0 154250 5859 test P: 0.0 test R: 0.0 test F1: 0 What is the problem that is willing to cause this?

WJYw avatar Jun 05 '19 11:06 WJYw

I have pre-trained first, and the learning rate was 0.00002, but the accuracy was 0 when I tested the test set. python main.py test --test True checkpoints/model_HRL_10 The result is: 0 148107 5628 0 150425 5713 0 152828 5803 0 154250 5859 test P: 0.0 test R: 0.0 test F1: 0 What is the problem that is willing to cause this?

I meet the same problem where F1 in training process is good but test set F1 are 0. Have you solved this problem?

yin-hong avatar Dec 06 '19 02:12 yin-hong

Similar question.@BenjaminWinter

Train the model with python main.py --lr 2e-5 --datapath ../data/NYT10/

Dev and test set F1 are 0, training accuracy is 0 on each training epoh while the loss is continuing decline After training 2 epochs, all sentences' top_actions are[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0]

Environment: Python 3.5.2 Pytorch 1.0.1(Have change the code to Pytorch1.0 and no error came out yet ) @truthless11

What are you change code for 0.3 to 1.0.1?Can you give me you rewrite code?Thank you very much!

Yangzhenping520 avatar Jun 27 '20 13:06 Yangzhenping520

@truthless11 I have encountered this problem. What is the reason and how to solve it

THCudaCheck FAIL file=C:\w\1\s\tmp_conda_3.7_055306\conda\conda-bld\pytorch_1556690124416\work\torch/csrc/generic/StorageSharing.cpp line=245 error=63 : OS call failed or operation not supported on this OS Traceback (most recent call last): File "E://HRL-RE-master/code/main.py", line 103, in p.start() File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\process.py", line 112, in start self._popen = self._Popen(self) File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\popen_spawn_win32.py", line 89, in init reduction.dump(process_obj, to_child) File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) File "F:\SoftWare\Anaconda\envs\PyTorch\lib\site-packages\torch\multiprocessing\reductions.py", line 231, in reduce_tensor event_sync_required) = storage.share_cuda() RuntimeError: cuda runtime error (63) : OS call failed or operation not supported on this OS at C:\w\1\s\tmp_conda_3.7_055306\conda\conda-bld\pytorch_1556690124416\work\torch/csrc/generic/StorageSharing.cpp:245

YiYingsheng avatar Jul 07 '20 08:07 YiYingsheng

@truthless11 I have encountered this problem. What is the reason and how to solve it

THCudaCheck FAIL file=C:\w\1\s\tmp_conda_3.7_055306\conda\conda-bld\pytorch_1556690124416\work\torch/csrc/generic/StorageSharing.cpp line=245 error=63 : OS call failed or operation not supported on this OS Traceback (most recent call last): File "E://HRL-RE-master/code/main.py", line 103, in p.start() File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\process.py", line 112, in start self._popen = self._Popen(self) File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\popen_spawn_win32.py", line 89, in init reduction.dump(process_obj, to_child) File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) File "F:\SoftWare\Anaconda\envs\PyTorch\lib\site-packages\torch\multiprocessing\reductions.py", line 231, in reduce_tensor event_sync_required) = storage.share_cuda() RuntimeError: cuda runtime error (63) : OS call failed or operation not supported on this OS at C:\w\1\s\tmp_conda_3.7_055306\conda\conda-bld\pytorch_1556690124416\work\torch/csrc/generic/StorageSharing.cpp:245

@YiYingsheng Excuse me, Have you solved it?

xxxxxi-gg avatar Jul 20 '20 08:07 xxxxxi-gg