HRL-RE
HRL-RE copied to clipboard
Pretraining works fine, but rl training stays at 0 Accuracy
Running:
- Pytorch 0.3.1
- Python 3.5.2
The RL training doesn't work for me for the NYT10 dataset (havent checked other yet). I first ran pretraining for 10 epochs with:
python main.py --epochPre 10 --numprocess 8 --datapath ../data/NYT10/ --pretrain True
which gets roughly 58 F1 on the test set, and then afterwards try the RL training with:
python main.py --epochRL 10 --numprocess 8 --start checkpoints/model_HRL_10 --datapath ../data/NYT10/
I stopped RL training after 3 epochs because not only was dev and test set F1 at 0, even training accuracy is 0. Loss started at around 30, then after only 60 batches decreases to about -20, then slowly increases again and ends up hovering around -0.00005 Checking the optimize() method all reward arrays contain either straight 0's or negative numbers.
Try to reduce the learning rate plz. I‘m not sure what's going wrong, but it worth a try.
thank you for your quick reply. I tried that over the weekend,and a lower learning rate (0.00002) indeed helped a little bit. The accuracy is not pinned at 0 anymore, but it still stays in single digits. Shouldn't it already start higher due to the pretraining?
Would it be possible for you to share a pretraining model and a set of hyperparameters that work for you?
I have just rerun the pretraining for the NYT10 dataset with
python3 main.py --datapath ../data/NYT10/ --pretrain True
and gets about 62 F1 on the test set. Here's the log output
epoch 0: dev F1: 0.5301069217782779, test F1: 0.46435134141859613 epoch 1: dev F1: 0.6318181818181818, test F1: 0.5483187471211423 epoch 2: dev F1: 0.6612162616194424, test F1: 0.5576974804985205 epoch 3: dev F1: 0.715510522213562, test F1: 0.6136810144668691 epoch 4: dev F1: 0.7223641817575825, test F1: 0.6080997979443029 epoch 5: dev F1: 0.7285908473040326, test F1: 0.6140748120641246 epoch 6: dev F1: 0.7290673172895641, test F1: 0.6133403731080604 epoch 7: dev F1: 0.7419283010465375, test F1: 0.62323850039883 epoch 8: dev F1: 0.7316315205327415, test F1: 0.6132978723404255 epoch 9: dev F1: 0.7432131731197152, test F1: 0.6178214317317052 epoch 10: dev F1: 0.7410535674594355, test F1: 0.6209842484648929 epoch 11: dev F1: 0.7460694491573352, test F1: 0.6244008320520936 epoch 12: dev F1: 0.7455326849129156, test F1: 0.6203670385030586 epoch 13: dev F1: 0.7360536612632756, test F1: 0.6146158650843222 epoch 14: dev F1: 0.75188138829608, test F1: 0.6181299072091364
Then I train the model using RL with
python3 main.py --lr 2e-5 --datapath ../data/NYT10/ --start checkpoints/model_HRL_10
and the F1 score continues to rise,
epoch 0: dev F1: 0.7637886897835698, test F1: 0.6370854740775339 epoch 1: dev F1: 0.7609631266720949, test F1: 0.6375350140056023 epoch 2: dev F1: 0.7648014859530996, test F1: 0.6340052258305339
The model is still training, only the logs of the first 3 epochs are quoted here.
Environment:
- Python 3.5.2
- Pytorch 0.3.1
Similar question.@BenjaminWinter
Train the model with
python main.py --lr 2e-5 --datapath ../data/NYT10/
Dev and test set F1 are 0, training accuracy is 0 on each training epoh while the loss is continuing decline After training 2 epochs, all sentences' top_actions are[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0]
Environment: Python 3.5.2 Pytorch 1.0.1(Have change the code to Pytorch1.0 and no error came out yet ) @truthless11
@misaki-sysu I meet the same question, test set F1 are 0. Have you solved this problem?
I have pre-trained first, and the learning rate was 0.00002, but the accuracy was 0 when I tested the test set.
python main.py test --test True checkpoints/model_HRL_10
The result is:
0 148107 5628 0 150425 5713 0 152828 5803 0 154250 5859 test P: 0.0 test R: 0.0 test F1: 0
What is the problem that is willing to cause this?
I have pre-trained first, and the learning rate was 0.00002, but the accuracy was 0 when I tested the test set.
python main.py test --test True checkpoints/model_HRL_10
The result is:0 148107 5628 0 150425 5713 0 152828 5803 0 154250 5859 test P: 0.0 test R: 0.0 test F1: 0
What is the problem that is willing to cause this?
I meet the same problem where F1 in training process is good but test set F1 are 0. Have you solved this problem?
Similar question.@BenjaminWinter
Train the model with
python main.py --lr 2e-5 --datapath ../data/NYT10/
Dev and test set F1 are 0, training accuracy is 0 on each training epoh while the loss is continuing decline After training 2 epochs, all sentences' top_actions are[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0]
Environment: Python 3.5.2 Pytorch 1.0.1(Have change the code to Pytorch1.0 and no error came out yet ) @truthless11
What are you change code for 0.3 to 1.0.1?Can you give me you rewrite code?Thank you very much!
@truthless11 I have encountered this problem. What is the reason and how to solve it
THCudaCheck FAIL file=C:\w\1\s\tmp_conda_3.7_055306\conda\conda-bld\pytorch_1556690124416\work\torch/csrc/generic/StorageSharing.cpp line=245 error=63 : OS call failed or operation not supported on this OS
Traceback (most recent call last):
File "E://HRL-RE-master/code/main.py", line 103, in
@truthless11 I have encountered this problem. What is the reason and how to solve it
THCudaCheck FAIL file=C:\w\1\s\tmp_conda_3.7_055306\conda\conda-bld\pytorch_1556690124416\work\torch/csrc/generic/StorageSharing.cpp line=245 error=63 : OS call failed or operation not supported on this OS Traceback (most recent call last): File "E://HRL-RE-master/code/main.py", line 103, in p.start() File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\process.py", line 112, in start self._popen = self._Popen(self) File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\popen_spawn_win32.py", line 89, in init reduction.dump(process_obj, to_child) File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) File "F:\SoftWare\Anaconda\envs\PyTorch\lib\site-packages\torch\multiprocessing\reductions.py", line 231, in reduce_tensor event_sync_required) = storage.share_cuda() RuntimeError: cuda runtime error (63) : OS call failed or operation not supported on this OS at C:\w\1\s\tmp_conda_3.7_055306\conda\conda-bld\pytorch_1556690124416\work\torch/csrc/generic/StorageSharing.cpp:245
@YiYingsheng Excuse me, Have you solved it?