transformer-xl
transformer-xl copied to clipboard
How to train models with attn_type=2 on wiki103 training set?
I want to train a model with attn_type=2, and here is my configure. Experiment dir : wt103_workdir/-wt103/20190121-201645 Loading cached dataset...
- data : ../data/wikitext-103/
- dataset : wt103
- n_layer : 16
- n_head : 10
- d_head : 41
- d_embed : 410
- d_model : 410
- d_inner : 2100
- dropout : 0.1
- dropatt : 0.0
- init : normal
- emb_init : normal
- init_range : 0.1
- emb_init_range : 0.01
- init_std : 0.02
- proj_init_std : 0.01
- optim : adam
- lr : 0.00025
- mom : 0.0
- scheduler : cosine
- warmup_step : 0
- decay_rate : 0.5
- lr_min : 0.0
- clip : 0.25
- clip_nonemb : False
- max_step : 200000
- batch_size : 60
- batch_chunk : 1
- tgt_len : 150
- eval_tgt_len : 150
- ext_len : 0
- mem_len : 0
- not_tied : False
- seed : 1111
- cuda : True
- adaptive : True
- div_val : 1
- pre_lnorm : False
- varlen : False
- multi_gpu : True
- log_interval : 200
- eval_interval : 4000
- work_dir : wt103_workdir/-wt103/20190121-201645
- restart : False
- restart_dir :
- debug : False
- same_length : False
- attn_type : 2
- clamp_len : -1
- eta_min : 0.0
- gpu0_bsz : 4
- max_eval_steps : -1
- sample_softmax : -1
- patience : 0
- finetune_v2 : False
- finetune_v3 : False
- fp16 : False
- static_loss_scale : 1
- dynamic_loss_scale : False
- tied : True
- n_token : 267735
- n_all_param : 148417118
- n_nonemb_param : 38376800
But it seems to diverge. Can anyone give me some advice? Thanks very much! | epoch 1 step 200 | 200 batches | lr 0.00025 | ms/batch 702.80 | loss 7.64 | ppl 2088.928 | epoch 1 step 400 | 400 batches | lr 0.00025 | ms/batch 621.46 | loss 7.46 | ppl 1730.912 | epoch 1 step 600 | 600 batches | lr 0.00025 | ms/batch 621.39 | loss 7.45 | ppl 1728.002 | epoch 1 step 800 | 800 batches | lr 0.00025 | ms/batch 621.19 | loss 7.45 | ppl 1717.449 | epoch 1 step 1000 | 1000 batches | lr 0.00025 | ms/batch 621.40 | loss 7.45 | ppl 1720.351 | epoch 1 step 1200 | 1200 batches | lr 0.00025 | ms/batch 620.97 | loss 7.44 | ppl 1700.753 | epoch 1 step 1400 | 1400 batches | lr 0.00025 | ms/batch 620.69 | loss 7.44 | ppl 1695.095 | epoch 1 step 1600 | 1600 batches | lr 0.00025 | ms/batch 635.92 | loss 7.44 | ppl 1711.255 | epoch 1 step 1800 | 1800 batches | lr 0.00025 | ms/batch 620.47 | loss 7.44 | ppl 1710.397 | epoch 1 step 2000 | 2000 batches | lr 0.00025 | ms/batch 619.97 | loss 7.44 | ppl 1695.020 | epoch 1 step 2200 | 2200 batches | lr 0.00025 | ms/batch 620.45 | loss 7.43 | ppl 1690.592 | epoch 1 step 2400 | 2400 batches | lr 0.00025 | ms/batch 620.02 | loss 7.44 | ppl 1702.485 | epoch 1 step 2600 | 2600 batches | lr 0.00025 | ms/batch 620.64 | loss 7.43 | ppl 1689.785 | epoch 1 step 2800 | 2800 batches | lr 0.00025 | ms/batch 619.92 | loss 7.43 | ppl 1693.790 | epoch 1 step 3000 | 3000 batches | lr 0.00025 | ms/batch 620.23 | loss 7.43 | ppl 1684.638 | epoch 1 step 3200 | 3200 batches | lr 0.00025 | ms/batch 619.65 | loss 7.42 | ppl 1666.079 | epoch 1 step 3400 | 3400 batches | lr 0.00025 | ms/batch 619.42 | loss 7.41 | ppl 1659.356 | epoch 1 step 3600 | 3600 batches | lr 0.00025 | ms/batch 619.80 | loss 7.41 | ppl 1649.053 | epoch 1 step 3800 | 3800 batches | lr 0.00025 | ms/batch 620.29 | loss 7.45 | ppl 1711.666 | epoch 1 step 4000 | 4000 batches | lr 0.00025 | ms/batch 620.61 | loss 7.42 | ppl 1661.076
| Eval 1 at step 4000 | time: 2506.77s | valid loss 7.41 | valid ppl 1654.046
| epoch 1 step 4200 | 4200 batches | lr 0.00025 | ms/batch 662.27 | loss 7.43 | ppl 1682.745 | epoch 1 step 4400 | 4400 batches | lr 0.00025 | ms/batch 619.80 | loss 7.42 | ppl 1672.333 | epoch 1 step 4600 | 4600 batches | lr 0.00025 | ms/batch 619.94 | loss 7.42 | ppl 1668.553 | epoch 1 step 4800 | 4800 batches | lr 0.00025 | ms/batch 619.89 | loss 7.42 | ppl 1669.587 | epoch 1 step 5000 | 5000 batches | lr 0.00025 | ms/batch 620.04 | loss 7.44 | ppl 1705.483 | epoch 1 step 5200 | 5200 batches | lr 0.00025 | ms/batch 619.59 | loss 7.44 | ppl 1707.168 | epoch 1 step 5400 | 5400 batches | lr 0.00025 | ms/batch 619.41 | loss 7.41 | ppl 1656.997 | epoch 1 step 5600 | 5600 batches | lr 0.00025 | ms/batch 619.81 | loss 7.43 | ppl 1682.111 | epoch 1 step 5800 | 5800 batches | lr 0.000249 | ms/batch 620.20 | loss 7.44 | ppl 1695.797 | epoch 1 step 6000 | 6000 batches | lr 0.000249 | ms/batch 619.68 | loss 7.43 | ppl 1691.197 | epoch 1 step 6200 | 6200 batches | lr 0.000249 | ms/batch 619.43 | loss 7.41 | ppl 1654.504 | epoch 1 step 6400 | 6400 batches | lr 0.000249 | ms/batch 620.16 | loss 7.43 | ppl 1688.890 | epoch 1 step 6600 | 6600 batches | lr 0.000249 | ms/batch 620.38 | loss 7.43 | ppl 1678.386 | epoch 1 step 6800 | 6800 batches | lr 0.000249 | ms/batch 619.64 | loss 7.42 | ppl 1664.799 | epoch 1 step 7000 | 7000 batches | lr 0.000249 | ms/batch 619.72 | loss 7.42 | ppl 1670.465 | epoch 1 step 7200 | 7200 batches | lr 0.000249 | ms/batch 620.02 | loss 7.42 | ppl 1667.845 | epoch 1 step 7400 | 7400 batches | lr 0.000249 | ms/batch 619.56 | loss 7.42 | ppl 1670.860 | epoch 1 step 7600 | 7600 batches | lr 0.000249 | ms/batch 620.25 | loss 7.42 | ppl 1664.089 | epoch 1 step 7800 | 7800 batches | lr 0.000249 | ms/batch 619.83 | loss 7.42 | ppl 1661.560 | epoch 1 step 8000 | 8000 batches | lr 0.000249 | ms/batch 619.85 | loss 7.43 | ppl 1689.380
| Eval 2 at step 8000 | time: 2484.77s | valid loss 7.39 | valid ppl 1616.071
| epoch 1 step 8200 | 8200 batches | lr 0.000249 | ms/batch 672.76 | loss 7.43 | ppl 1680.628 | epoch 1 step 8400 | 8400 batches | lr 0.000249 | ms/batch 620.21 | loss 7.43 | ppl 1685.528 | epoch 1 step 8600 | 8600 batches | lr 0.000249 | ms/batch 619.91 | loss 7.43 | ppl 1684.851 | epoch 1 step 8800 | 8800 batches | lr 0.000249 | ms/batch 620.02 | loss 7.44 | ppl 1699.004 | epoch 1 step 9000 | 9000 batches | lr 0.000249 | ms/batch 619.53 | loss 7.42 | ppl 1667.265 | epoch 1 step 9200 | 9200 batches | lr 0.000249 | ms/batch 620.79 | loss 7.43 | ppl 1684.868 | epoch 1 step 9400 | 9400 batches | lr 0.000249 | ms/batch 620.06 | loss 7.42 | ppl 1672.693 | epoch 1 step 9600 | 9600 batches | lr 0.000249 | ms/batch 619.62 | loss 7.43 | ppl 1689.861 | epoch 1 step 9800 | 9800 batches | lr 0.000249 | ms/batch 619.44 | loss 7.41 | ppl 1652.922 | epoch 1 step 10000 | 10000 batches | lr 0.000248 | ms/batch 620.06 | loss 7.43 | ppl 1692.675 | epoch 1 step 10200 | 10200 batches | lr 0.000248 | ms/batch 619.46 | loss 7.41 | ppl 1653.468 | epoch 1 step 10400 | 10400 batches | lr 0.000248 | ms/batch 619.62 | loss 7.41 | ppl 1651.442 | epoch 1 step 10600 | 10600 batches | lr 0.000248 | ms/batch 620.05 | loss 7.41 | ppl 1652.406 | epoch 1 step 10800 | 10800 batches | lr 0.000248 | ms/batch 619.74 | loss 7.41 | ppl 1658.664 | epoch 1 step 11000 | 11000 batches | lr 0.000248 | ms/batch 619.59 | loss 7.44 | ppl 1694.259 | epoch 1 step 11200 | 11200 batches | lr 0.000248 | ms/batch 619.55 | loss 7.42 | ppl 1672.915 | epoch 1 step 11400 | 11400 batches | lr 0.000248 | ms/batch 619.08 | loss 7.42 | ppl 1664.737 | epoch 2 step 11600 | 130 batches | lr 0.000248 | ms/batch 620.83 | loss 7.38 | ppl 1601.478 | epoch 2 step 11800 | 330 batches | lr 0.000248 | ms/batch 621.36 | loss 7.31 | ppl 1490.017 | epoch 2 step 12000 | 530 batches | lr 0.000248 | ms/batch 621.48 | loss 7.33 | ppl 1523.110
| Eval 3 at step 12000 | time: 2485.37s | valid loss 7.42 | valid ppl 1674.380
| epoch 2 step 12200 | 730 batches | lr 0.000248 | ms/batch 648.06 | loss 7.31 | ppl 1498.382 | epoch 2 step 12400 | 930 batches | lr 0.000248 | ms/batch 621.16 | loss 7.33 | ppl 1527.507 | epoch 2 step 12600 | 1130 batches | lr 0.000248 | ms/batch 621.00 | loss 7.33 | ppl 1530.506 | epoch 2 step 12800 | 1330 batches | lr 0.000247 | ms/batch 620.95 | loss 7.33 | ppl 1525.309 | epoch 2 step 13000 | 1530 batches | lr 0.000247 | ms/batch 621.32 | loss 7.33 | ppl 1527.929 | epoch 2 step 13200 | 1730 batches | lr 0.000247 | ms/batch 621.35 | loss 7.34 | ppl 1543.376 | epoch 2 step 13400 | 1930 batches | lr 0.000247 | ms/batch 621.04 | loss 7.33 | ppl 1523.908 | epoch 2 step 13600 | 2130 batches | lr 0.000247 | ms/batch 621.29 | loss 7.34 | ppl 1546.512 | epoch 2 step 13800 | 2330 batches | lr 0.000247 | ms/batch 620.99 | loss 7.34 | ppl 1545.263 | epoch 2 step 14000 | 2530 batches | lr 0.000247 | ms/batch 621.08 | loss 7.34 | ppl 1540.291 | epoch 2 step 14200 | 2730 batches | lr 0.000247 | ms/batch 620.93 | loss 7.34 | ppl 1540.285 | epoch 2 step 14400 | 2930 batches | lr 0.000247 | ms/batch 621.68 | loss 7.34 | ppl 1540.759 | epoch 2 step 14600 | 3130 batches | lr 0.000247 | ms/batch 621.22 | loss 7.32 | ppl 1512.795 | epoch 2 step 14800 | 3330 batches | lr 0.000247 | ms/batch 621.04 | loss 7.32 | ppl 1506.678 | epoch 2 step 15000 | 3530 batches | lr 0.000247 | ms/batch 621.31 | loss 7.33 | ppl 1530.028 | epoch 2 step 15200 | 3730 batches | lr 0.000246 | ms/batch 621.44 | loss 7.34 | ppl 1537.768 | epoch 2 step 15400 | 3930 batches | lr 0.000246 | ms/batch 621.56 | loss 7.33 | ppl 1532.047 | epoch 2 step 15600 | 4130 batches | lr 0.000246 | ms/batch 622.21 | loss 7.34 | ppl 1535.568 | epoch 2 step 15800 | 4330 batches | lr 0.000246 | ms/batch 621.75 | loss 7.34 | ppl 1537.776 | epoch 2 step 16000 | 4530 batches | lr 0.000246 | ms/batch 621.52 | loss 7.33 | ppl 1524.707
| Eval 4 at step 16000 | time: 2490.61s | valid loss 7.42 | valid ppl 1664.061
| epoch 2 step 16200 | 4730 batches | lr 0.000246 | ms/batch 648.01 | loss 7.33 | ppl 1531.670 | epoch 2 step 16400 | 4930 batches | lr 0.000246 | ms/batch 621.81 | loss 7.35 | ppl 1561.311 | epoch 2 step 16600 | 5130 batches | lr 0.000246 | ms/batch 621.62 | loss 7.35 | ppl 1558.448 | epoch 2 step 16800 | 5330 batches | lr 0.000246 | ms/batch 621.31 | loss 7.34 | ppl 1544.213 | epoch 2 step 17000 | 5530 batches | lr 0.000246 | ms/batch 621.27 | loss 7.33 | ppl 1521.180 | epoch 2 step 17200 | 5730 batches | lr 0.000245 | ms/batch 621.11 | loss 7.36 | ppl 1577.214 | epoch 2 step 17400 | 5930 batches | lr 0.000245 | ms/batch 620.95 | loss 7.35 | ppl 1551.961 | epoch 2 step 17600 | 6130 batches | lr 0.000245 | ms/batch 620.91 | loss 7.34 | ppl 1546.448 | epoch 2 step 17800 | 6330 batches | lr 0.000245 | ms/batch 621.03 | loss 7.34 | ppl 1534.776 | epoch 2 step 18000 | 6530 batches | lr 0.000245 | ms/batch 621.68 | loss 7.36 | ppl 1571.506 | epoch 2 step 18200 | 6730 batches | lr 0.000245 | ms/batch 621.27 | loss 7.34 | ppl 1535.712 | epoch 2 step 18400 | 6930 batches | lr 0.000245 | ms/batch 621.65 | loss 7.34 | ppl 1538.308 | epoch 2 step 18600 | 7130 batches | lr 0.000245 | ms/batch 620.88 | loss 7.34 | ppl 1541.480 | epoch 2 step 18800 | 7330 batches | lr 0.000245 | ms/batch 621.06 | loss 7.34 | ppl 1539.062 | epoch 2 step 19000 | 7530 batches | lr 0.000244 | ms/batch 621.02 | loss 7.35 | ppl 1556.423 | epoch 2 step 19200 | 7730 batches | lr 0.000244 | ms/batch 621.01 | loss 7.33 | ppl 1530.237 | epoch 2 step 19400 | 7930 batches | lr 0.000244 | ms/batch 621.38 | loss 7.35 | ppl 1560.169 | epoch 2 step 19600 | 8130 batches | lr 0.000244 | ms/batch 621.06 | loss 7.34 | ppl 1543.635 | epoch 2 step 19800 | 8330 batches | lr 0.000244 | ms/batch 621.31 | loss 7.34 | ppl 1546.412 | epoch 2 step 20000 | 8530 batches | lr 0.000244 | ms/batch 620.97 | loss 7.36 | ppl 1573.621
| Eval 5 at step 20000 | time: 2490.28s | valid loss 7.40 | valid ppl 1642.454
| epoch 2 step 20200 | 8730 batches | lr 0.000244 | ms/batch 648.18 | loss 7.35 | ppl 1552.783 | epoch 2 step 20400 | 8930 batches | lr 0.000244 | ms/batch 621.19 | loss 7.35 | ppl 1561.782 | epoch 2 step 20600 | 9130 batches | lr 0.000244 | ms/batch 621.56 | loss 7.35 | ppl 1551.505 | epoch 2 step 20800 | 9330 batches | lr 0.000243 | ms/batch 621.11 | loss 7.35 | ppl 1550.757 | epoch 2 step 21000 | 9530 batches | lr 0.000243 | ms/batch 621.23 | loss 7.36 | ppl 1576.482 | epoch 2 step 21200 | 9730 batches | lr 0.000243 | ms/batch 621.01 | loss 7.34 | ppl 1534.469 | epoch 2 step 21400 | 9930 batches | lr 0.000243 | ms/batch 621.09 | loss 7.35 | ppl 1552.626 | epoch 2 step 21600 | 10130 batches | lr 0.000243 | ms/batch 621.28 | loss 7.35 | ppl 1550.348 | epoch 2 step 21800 | 10330 batches | lr 0.000243 | ms/batch 621.55 | loss 7.35 | ppl 1555.845 | epoch 2 step 22000 | 10530 batches | lr 0.000243 | ms/batch 620.96 | loss 7.34 | ppl 1533.085 | epoch 2 step 22200 | 10730 batches | lr 0.000242 | ms/batch 620.96 | loss 7.35 | ppl 1556.160 | epoch 2 step 22400 | 10930 batches | lr 0.000242 | ms/batch 621.35 | loss 7.35 | ppl 1562.793 | epoch 2 step 22600 | 11130 batches | lr 0.000242 | ms/batch 620.82 | loss 7.35 | ppl 1563.720 | epoch 2 step 22800 | 11330 batches | lr 0.000242 | ms/batch 621.20 | loss 7.36 | ppl 1566.230 | epoch 3 step 23000 | 60 batches | lr 0.000242 | ms/batch 620.74 | loss 7.34 | ppl 1541.840 | epoch 3 step 23200 | 260 batches | lr 0.000242 | ms/batch 621.55 | loss 7.28 | ppl 1453.898 | epoch 3 step 23400 | 460 batches | lr 0.000242 | ms/batch 622.00 | loss 7.30 | ppl 1479.356 | epoch 3 step 23600 | 660 batches | lr 0.000242 | ms/batch 621.62 | loss 7.29 | ppl 1458.763 | epoch 3 step 23800 | 860 batches | lr 0.000241 | ms/batch 621.74 | loss 7.31 | ppl 1488.324 | epoch 3 step 24000 | 1060 batches | lr 0.000241 | ms/batch 622.03 | loss 7.30 | ppl 1476.624
| Eval 6 at step 24000 | time: 2490.67s | valid loss 7.44 | valid ppl 1703.611
| epoch 3 step 24200 | 1260 batches | lr 0.000241 | ms/batch 648.34 | loss 7.30 | ppl 1478.781 | epoch 3 step 24400 | 1460 batches | lr 0.000241 | ms/batch 621.87 | loss 7.30 | ppl 1476.575 | epoch 3 step 24600 | 1660 batches | lr 0.000241 | ms/batch 621.75 | loss 7.31 | ppl 1499.300 | epoch 3 step 24800 | 1860 batches | lr 0.000241 | ms/batch 621.83 | loss 7.30 | ppl 1477.016 | epoch 3 step 25000 | 2060 batches | lr 0.00024 | ms/batch 622.08 | loss 7.31 | ppl 1500.889 | epoch 3 step 25200 | 2260 batches | lr 0.00024 | ms/batch 621.62 | loss 7.31 | ppl 1495.962 | epoch 3 step 25400 | 2460 batches | lr 0.00024 | ms/batch 621.89 | loss 7.31 | ppl 1492.161 | epoch 3 step 25600 | 2660 batches | lr 0.00024 | ms/batch 621.87 | loss 7.31 | ppl 1492.371 | epoch 3 step 25800 | 2860 batches | lr 0.00024 | ms/batch 621.40 | loss 7.31 | ppl 1491.645 | epoch 3 step 26000 | 3060 batches | lr 0.00024 | ms/batch 621.84 | loss 7.30 | ppl 1484.346 | epoch 3 step 26200 | 3260 batches | lr 0.00024 | ms/batch 621.87 | loss 7.29 | ppl 1466.224 | epoch 3 step 26400 | 3460 batches | lr 0.000239 | ms/batch 621.72 | loss 7.29 | ppl 1471.563 | epoch 3 step 26600 | 3660 batches | lr 0.000239 | ms/batch 621.46 | loss 7.30 | ppl 1484.499 | epoch 3 step 26800 | 3860 batches | lr 0.000239 | ms/batch 621.93 | loss 7.31 | ppl 1492.562 | epoch 3 step 27000 | 4060 batches | lr 0.000239 | ms/batch 621.78 | loss 7.30 | ppl 1474.048 | epoch 3 step 27200 | 4260 batches | lr 0.000239 | ms/batch 621.84 | loss 7.31 | ppl 1498.078 | epoch 3 step 27400 | 4460 batches | lr 0.000239 | ms/batch 621.81 | loss 7.30 | ppl 1477.633 | epoch 3 step 27600 | 4660 batches | lr 0.000238 | ms/batch 621.69 | loss 7.31 | ppl 1491.983 | epoch 3 step 27800 | 4860 batches | lr 0.000238 | ms/batch 621.52 | loss 7.32 | ppl 1507.707 | epoch 3 step 28000 | 5060 batches | lr 0.000238 | ms/batch 621.40 | loss 7.32 | ppl 1507.557
| Eval 7 at step 28000 | time: 2492.27s | valid loss 7.44 | valid ppl 1706.144
| epoch 3 step 28200 | 5260 batches | lr 0.000238 | ms/batch 648.62 | loss 7.32 | ppl 1502.753 | epoch 3 step 28400 | 5460 batches | lr 0.000238 | ms/batch 621.67 | loss 7.31 | ppl 1487.829 | epoch 3 step 28600 | 5660 batches | lr 0.000238 | ms/batch 621.82 | loss 7.33 | ppl 1518.292 | epoch 3 step 28800 | 5860 batches | lr 0.000237 | ms/batch 621.37 | loss 7.32 | ppl 1510.123 | epoch 3 step 29000 | 6060 batches | lr 0.000237 | ms/batch 621.82 | loss 7.31 | ppl 1501.527 | epoch 3 step 29200 | 6260 batches | lr 0.000237 | ms/batch 621.80 | loss 7.31 | ppl 1488.194 | epoch 3 step 29400 | 6460 batches | lr 0.000237 | ms/batch 621.68 | loss 7.32 | ppl 1513.990 | epoch 3 step 29600 | 6660 batches | lr 0.000237 | ms/batch 621.60 | loss 7.32 | ppl 1503.472 | epoch 3 step 29800 | 6860 batches | lr 0.000237 | ms/batch 621.45 | loss 7.31 | ppl 1491.761 | epoch 3 step 30000 | 7060 batches | lr 0.000236 | ms/batch 621.66 | loss 7.31 | ppl 1500.846 | epoch 3 step 30200 | 7260 batches | lr 0.000236 | ms/batch 621.62 | loss 7.31 | ppl 1494.195 | epoch 3 step 30400 | 7460 batches | lr 0.000236 | ms/batch 621.88 | loss 7.31 | ppl 1501.704 | epoch 3 step 30600 | 7660 batches | lr 0.000236 | ms/batch 621.50 | loss 7.31 | ppl 1493.181 | epoch 3 step 30800 | 7860 batches | lr 0.000236 | ms/batch 622.06 | loss 7.31 | ppl 1493.227 | epoch 3 step 31000 | 8060 batches | lr 0.000235 | ms/batch 621.64 | loss 7.31 | ppl 1501.180 | epoch 3 step 31200 | 8260 batches | lr 0.000235 | ms/batch 621.87 | loss 7.31 | ppl 1501.493 | epoch 3 step 31400 | 8460 batches | lr 0.000235 | ms/batch 621.95 | loss 7.33 | ppl 1518.169 | epoch 3 step 31600 | 8660 batches | lr 0.000235 | ms/batch 621.71 | loss 7.31 | ppl 1502.388 | epoch 3 step 31800 | 8860 batches | lr 0.000235 | ms/batch 621.56 | loss 7.32 | ppl 1511.796 | epoch 3 step 32000 | 9060 batches | lr 0.000235 | ms/batch 621.97 | loss 7.31 | ppl 1500.025
| Eval 8 at step 32000 | time: 2492.28s | valid loss 7.44 | valid ppl 1708.262
| epoch 3 step 32200 | 9260 batches | lr 0.000234 | ms/batch 648.48 | loss 7.32 | ppl 1502.871 | epoch 3 step 32400 | 9460 batches | lr 0.000234 | ms/batch 621.60 | loss 7.33 | ppl 1527.820 | epoch 3 step 32600 | 9660 batches | lr 0.000234 | ms/batch 621.64 | loss 7.31 | ppl 1492.826 | epoch 3 step 32800 | 9860 batches | lr 0.000234 | ms/batch 621.63 | loss 7.31 | ppl 1495.967 | epoch 3 step 33000 | 10060 batches | lr 0.000234 | ms/batch 622.10 | loss 7.33 | ppl 1524.507 | epoch 3 step 33200 | 10260 batches | lr 0.000233 | ms/batch 621.53 | loss 7.30 | ppl 1485.690 | epoch 3 step 33400 | 10460 batches | lr 0.000233 | ms/batch 621.17 | loss 7.31 | ppl 1492.452 | epoch 3 step 33600 | 10660 batches | lr 0.000233 | ms/batch 621.04 | loss 7.31 | ppl 1498.407 | epoch 3 step 33800 | 10860 batches | lr 0.000233 | ms/batch 621.77 | loss 7.32 | ppl 1512.605 | epoch 3 step 34000 | 11060 batches | lr 0.000233 | ms/batch 621.23 | loss 7.32 | ppl 1504.775 | epoch 3 step 34200 | 11260 batches | lr 0.000232 | ms/batch 621.52 | loss 7.33 | ppl 1523.020 | epoch 3 step 34400 | 11460 batches | lr 0.000232 | ms/batch 620.80 | loss 7.31 | ppl 1491.825 | epoch 4 step 34600 | 190 batches | lr 0.000232 | ms/batch 621.64 | loss 7.29 | ppl 1465.376 | epoch 4 step 34800 | 390 batches | lr 0.000232 | ms/batch 621.66 | loss 7.28 | ppl 1450.339 | epoch 4 step 35000 | 590 batches | lr 0.000232 | ms/batch 621.68 | loss 7.28 | ppl 1457.213 | epoch 4 step 35200 | 790 batches | lr 0.000231 | ms/batch 621.64 | loss 7.28 | ppl 1456.289 | epoch 4 step 35400 | 990 batches | lr 0.000231 | ms/batch 621.70 | loss 7.29 | ppl 1464.555 | epoch 4 step 35600 | 1190 batches | lr 0.000231 | ms/batch 621.51 | loss 7.28 | ppl 1455.969 | epoch 4 step 35800 | 1390 batches | lr 0.000231 | ms/batch 621.76 | loss 7.29 | ppl 1460.029 | epoch 4 step 36000 | 1590 batches | lr 0.000231 | ms/batch 622.02 | loss 7.29 | ppl 1469.949
| Eval 9 at step 36000 | time: 2491.61s | valid loss 7.45 | valid ppl 1727.050
| epoch 4 step 36200 | 1790 batches | lr 0.00023 | ms/batch 648.55 | loss 7.29 | ppl 1465.296 | epoch 4 step 36400 | 1990 batches | lr 0.00023 | ms/batch 621.63 | loss 7.29 | ppl 1470.353 | epoch 4 step 36600 | 2190 batches | lr 0.00023 | ms/batch 621.60 | loss 7.29 | ppl 1466.556 | epoch 4 step 36800 | 2390 batches | lr 0.00023 | ms/batch 621.81 | loss 7.30 | ppl 1481.549 | epoch 4 step 37000 | 2590 batches | lr 0.000229 | ms/batch 621.74 | loss 7.29 | ppl 1459.671 | epoch 4 step 37200 | 2790 batches | lr 0.000229 | ms/batch 622.05 | loss 7.30 | ppl 1475.031 | epoch 4 step 37400 | 2990 batches | lr 0.000229 | ms/batch 621.83 | loss 7.29 | ppl 1471.230 | epoch 4 step 37600 | 3190 batches | lr 0.000229 | ms/batch 621.65 | loss 7.28 | ppl 1444.171 | epoch 4 step 37800 | 3390 batches | lr 0.000229 | ms/batch 621.73 | loss 7.27 | ppl 1439.950 | epoch 4 step 38000 | 3590 batches | lr 0.000228 | ms/batch 621.45 | loss 7.28 | ppl 1454.786 | epoch 4 step 38200 | 3790 batches | lr 0.000228 | ms/batch 622.01 | loss 7.30 | ppl 1479.319 | epoch 4 step 38400 | 3990 batches | lr 0.000228 | ms/batch 621.87 | loss 7.28 | ppl 1451.045 | epoch 4 step 38600 | 4190 batches | lr 0.000228 | ms/batch 622.00 | loss 7.30 | ppl 1475.214 | epoch 4 step 38800 | 4390 batches | lr 0.000227 | ms/batch 621.46 | loss 7.29 | ppl 1460.207 | epoch 4 step 39000 | 4590 batches | lr 0.000227 | ms/batch 621.24 | loss 7.29 | ppl 1462.538 | epoch 4 step 39200 | 4790 batches | lr 0.000227 | ms/batch 621.45 | loss 7.29 | ppl 1468.254 | epoch 4 step 39400 | 4990 batches | lr 0.000227 | ms/batch 621.85 | loss 7.31 | ppl 1493.102 | epoch 4 step 39600 | 5190 batches | lr 0.000227 | ms/batch 622.29 | loss 7.31 | ppl 1493.513 | epoch 4 step 39800 | 5390 batches | lr 0.000226 | ms/batch 621.24 | loss 7.29 | ppl 1459.525 | epoch 4 step 40000 | 5590 batches | lr 0.000226 | ms/batch 621.47 | loss 7.30 | ppl 1474.168
| Eval 10 at step 40000 | time: 2492.18s | valid loss 7.46 | valid ppl 1730.644
| epoch 4 step 40200 | 5790 batches | lr 0.000226 | ms/batch 648.23 | loss 7.31 | ppl 1495.160 | epoch 4 step 40400 | 5990 batches | lr 0.000226 | ms/batch 621.79 | loss 7.30 | ppl 1486.213 | epoch 4 step 40600 | 6190 batches | lr 0.000225 | ms/batch 621.62 | loss 7.29 | ppl 1462.185 | epoch 4 step 40800 | 6390 batches | lr 0.000225 | ms/batch 621.89 | loss 7.30 | ppl 1473.741 | epoch 4 step 41000 | 6590 batches | lr 0.000225 | ms/batch 621.52 | loss 7.31 | ppl 1488.068 | epoch 4 step 41200 | 6790 batches | lr 0.000225 | ms/batch 621.60 | loss 7.29 | ppl 1458.729 | epoch 4 step 41400 | 6990 batches | lr 0.000224 | ms/batch 621.76 | loss 7.30 | ppl 1480.750 | epoch 4 step 41600 | 7190 batches | lr 0.000224 | ms/batch 621.84 | loss 7.29 | ppl 1471.271 | epoch 4 step 41800 | 7390 batches | lr 0.000224 | ms/batch 621.72 | loss 7.30 | ppl 1479.743 | epoch 4 step 42000 | 7590 batches | lr 0.000224 | ms/batch 621.64 | loss 7.30 | ppl 1473.643 | epoch 4 step 42200 | 7790 batches | lr 0.000224 | ms/batch 621.37 | loss 7.28 | ppl 1447.959 | epoch 4 step 42400 | 7990 batches | lr 0.000223 | ms/batch 621.54 | loss 7.31 | ppl 1495.434 | epoch 4 step 42600 | 8190 batches | lr 0.000223 | ms/batch 621.55 | loss 7.29 | ppl 1462.527 | epoch 4 step 42800 | 8390 batches | lr 0.000223 | ms/batch 621.60 | loss 7.31 | ppl 1489.709 | epoch 4 step 43000 | 8590 batches | lr 0.000223 | ms/batch 621.65 | loss 7.30 | ppl 1484.012 | epoch 4 step 43200 | 8790 batches | lr 0.000222 | ms/batch 621.62 | loss 7.31 | ppl 1491.229 | epoch 4 step 43400 | 8990 batches | lr 0.000222 | ms/batch 621.14 | loss 7.30 | ppl 1476.093 | epoch 4 step 43600 | 9190 batches | lr 0.000222 | ms/batch 621.67 | loss 7.31 | ppl 1487.912 | epoch 4 step 43800 | 9390 batches | lr 0.000222 | ms/batch 621.81 | loss 7.30 | ppl 1474.515 | epoch 4 step 44000 | 9590 batches | lr 0.000221 | ms/batch 621.78 | loss 7.31 | ppl 1489.621
| Eval 11 at step 44000 | time: 2491.86s | valid loss 7.44 | valid ppl 1706.983
| epoch 4 step 44200 | 9790 batches | lr 0.000221 | ms/batch 648.36 | loss 7.29 | ppl 1464.462 | epoch 4 step 44400 | 9990 batches | lr 0.000221 | ms/batch 621.91 | loss 7.31 | ppl 1492.388 | epoch 4 step 44600 | 10190 batches | lr 0.000221 | ms/batch 621.52 | loss 7.29 | ppl 1465.266 | epoch 4 step 44800 | 10390 batches | lr 0.00022 | ms/batch 621.12 | loss 7.30 | ppl 1477.299 | epoch 4 step 45000 | 10590 batches | lr 0.00022 | ms/batch 620.76 | loss 7.29 | ppl 1464.611 | epoch 4 step 45200 | 10790 batches | lr 0.00022 | ms/batch 621.72 | loss 7.30 | ppl 1475.448 | epoch 4 step 45400 | 10990 batches | lr 0.00022 | ms/batch 621.22 | loss 7.31 | ppl 1493.140 | epoch 4 step 45600 | 11190 batches | lr 0.000219 | ms/batch 621.56 | loss 7.30 | ppl 1486.864 | epoch 4 step 45800 | 11390 batches | lr 0.000219 | ms/batch 621.13 | loss 7.29 | ppl 1470.398 | epoch 5 step 46000 | 120 batches | lr 0.000219 | ms/batch 621.16 | loss 7.29 | ppl 1462.640 | epoch 5 step 46200 | 320 batches | lr 0.000219 | ms/batch 622.12 | loss 7.27 | ppl 1429.701 | epoch 5 step 46400 | 520 batches | lr 0.000218 | ms/batch 622.40 | loss 7.28 | ppl 1455.915 | epoch 5 step 46600 | 720 batches | lr 0.000218 | ms/batch 621.56 | loss 7.27 | ppl 1429.803 | epoch 5 step 46800 | 920 batches | lr 0.000218 | ms/batch 621.79 | loss 7.28 | ppl 1447.379 | epoch 5 step 47000 | 1120 batches | lr 0.000217 | ms/batch 621.45 | loss 7.28 | ppl 1449.542 | epoch 5 step 47200 | 1320 batches | lr 0.000217 | ms/batch 621.70 | loss 7.28 | ppl 1444.158 | epoch 5 step 47400 | 1520 batches | lr 0.000217 | ms/batch 622.01 | loss 7.27 | ppl 1441.529 | epoch 5 step 47600 | 1720 batches | lr 0.000217 | ms/batch 621.49 | loss 7.28 | ppl 1456.560 | epoch 5 step 47800 | 1920 batches | lr 0.000216 | ms/batch 621.47 | loss 7.27 | ppl 1440.568 | epoch 5 step 48000 | 2120 batches | lr 0.000216 | ms/batch 621.64 | loss 7.29 | ppl 1461.721
| Eval 12 at step 48000 | time: 2491.63s | valid loss 7.47 | valid ppl 1752.609
From what we can see, this divergence is likely due to the fact that warmup_step
is set to 0.
Specifically, when using our proposed "relative positional encoding (attn_type=0)", the training and optimization are significantly easier and more stable. Hence, we don't need to use warmup_step
on wt103 at all. This is another advantage of the proposed relative positional encoding we didn't elaborate in the paper.
In contrast, when using the "absolute positional encoding (attn_type=2)", you may have to increase warmup_step
to 4000
or 10000
just to make sure the model does not diverge in the beginning, especially when the model size is large (like your case).
Please try to increase warmup_step
.
Thank you for your reply. When I increase warmup_step to 4000, training seems to be normal. And, I have another confusion. In order to get probabilities of each word in a sentence, doing inference instead of evaluation, how should we set the "tgt_len" and design the data stream(for attn_type=2). I have noticed that in your paper it seems "During evaluation, at each step, the vanilla model also consumes a segment of the same length as in training, but only makes one prediction at the last position. Then, at the next step, the segment is shifted to the right by only one position, and the new segment has to be processed all from scratch."
Since this is not the desired use case of this repo, you will need to modify the data iterator. Sorry about it.
@HyacinthJingjing , besides data iterator, for pytorch code, set ext_len=0, tgt_len=1, mem_len=<context_length>
. You'll need a function calculate logits(inherit ProjectedAdaptiveLogSoftmax
and call _compute_logit
is convenient).
By the way, sample_softmax
option seems buggy in pytorch code. @zihangdai