taiyaki icon indicating copy to clipboard operation
taiyaki copied to clipboard

AssertionError: Input not finite

Open dikkeaap opened this issue 4 years ago • 2 comments

Hi,

we have recently succesfully trained a model for a plant species sequenced on the MinION using R9.4 flowcell. We have also sequenced the same plant species on the MinION on R10.3 flowcell and scussefully trained a model with those data.

We now have sequenced the same plant (again) on PromethION R10.4 flowcell, but are running into an error when attempting to train a model:

* Taiyaki version 5.1.0
* Platform is Linux-4.15.0-38-generic-x86_64-with-debian-buster-sid
* PyTorch version 1.2.0
* CUDA version 10.0.130 on device GeForce GTX 1080 Ti
* Command line:
* "/opt/kgapps/taiyaki/bin/train_flipflop.py resume2/model_checkpoint_00018.checkpoint mapped_reads_2.hdf5 --min_sub_batch_size 48 --outdir resume3 --lr_max 0.00160 --niteration 40000 --lr_cosine_iters 30000 --overwrite --device 0
* Started on 2020-09-25 08:40:06.741154
* Loading data from mapped_reads_2.hdf5
* Per read file MD5 62e8f6baab6b7ca1d1c046bdaed7e933
* Reads not filtered by id
* Using alphabet definition: canonical alphabet ACGT and no modified bases
* Loaded 14191 reads.
* Reading network from resume2/model_checkpoint_00018.checkpoint
* Network has 10683280 parameters.
* Loaded standard (canonical bases-only) model.
* Dumping initial model
* Sampled 100000 chunks: median(mean_dwell)=9.20, mad(mean_dwell)=0.89
* Learning rate goes like cosine from lr_max to lr_min over 30000.0 iterations.
* At start, train for 200 batches at warm-up learning rate 0.0001
* Standard loss reporting from 141 validation reads held out of training. 
* Standard loss report: chunk length = 5500 & sub-batch size = 48 for 10 sub-batches. 
* Gradient L2 norm cap will be upper 0.05 quantile of the last 100 norms.
* Training
..................................................     1 0.10118 0.10477  116.30s (164.95 ksample/s 18.39 kbase/s) lr=1.00e-04  22.8% chunks filtered
..................................................     2 0.10152 0.10432  116.79s (164.35 ksample/s 18.31 kbase/s) lr=1.00e-04  23.1% chunks filtered
..................................................     3 0.10123 0.10401  110.55s (173.69 ksample/s 19.35 kbase/s) lr=1.00e-04  22.4% chunks filtered
..................................................     4 0.10249 0.10369  117.33s (163.65 ksample/s 18.22 kbase/s) lr=1.00e-04  22.5% chunks filtered
............................Traceback (most recent call last):
  File "/opt/kgapps/taiyaki/bin/train_flipflop.py", line 4, in <module>
    __import__('pkg_resources').run_script('taiyaki==5.1.0', 'train_flipflop.py')
  File "/opt/kgapps/taiyaki/lib/python3.7/site-packages/pkg_resources/__init__.py", line 666, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/opt/kgapps/taiyaki/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1446, in run_script
    exec(code, namespace, namespace)
  File "/opt/kgapps/taiyaki/lib/python3.7/site-packages/taiyaki-5.1.0-py3.7-linux-x86_64.egg/EGG-INFO/scripts/train_flipflop.py", line 624, in <module>
    main()
  File "/opt/kgapps/taiyaki/lib/python3.7/site-packages/taiyaki-5.1.0-py3.7-linux-x86_64.egg/EGG-INFO/scripts/train_flipflop.py", line 541, in main
    mod_factor_t, calc_grads = True )
  File "/opt/kgapps/taiyaki/lib/python3.7/site-packages/taiyaki-5.1.0-py3.7-linux-x86_64.egg/EGG-INFO/scripts/train_flipflop.py", line 247, in calculate_loss
    outputs, seqs, seqlens, sharpen)
  File "taiyaki/ctc/ctc.pyx", line 88, in taiyaki.ctc.ctc.FlipFlopCRF.forward
  File "taiyaki/ctc/ctc.pyx", line 62, in taiyaki.ctc.ctc.crf_flipflop_grad
AssertionError: Input not finite

If we resume from the checkpoint, we run into the same error sometime later:

* Taiyaki version 5.1.0
* Platform is Linux-4.15.0-38-generic-x86_64-with-debian-buster-sid
* PyTorch version 1.2.0
* CUDA version 10.0.130 on device GeForce GTX 1080 Ti
* Command line:
* "/opt/kgapps/taiyaki/bin/train_flipflop.py resume2/model_checkpoint_00018.checkpoint mapped_reads_2.hdf5 --min_sub_batch_size 48 --outdir resume3 --lr_max 0.00160 --niteration 40000 --lr_cosine_iters 30000 --overwrite --device 0
* Started on 2020-09-29 08:43:32.712279
* Loading data from mapped_reads_2.hdf5
* Per read file MD5 62e8f6baab6b7ca1d1c046bdaed7e933
* Reads not filtered by id
* Using alphabet definition: canonical alphabet ACGT and no modified bases
* Loaded 14191 reads.
* Reading network from resume2/model_checkpoint_00018.checkpoint
* Network has 10683280 parameters.
* Loaded standard (canonical bases-only) model.
* Dumping initial model
* Sampled 100000 chunks: median(mean_dwell)=9.20, mad(mean_dwell)=0.89
* Learning rate goes like cosine from lr_max to lr_min over 30000.0 iterations.
* At start, train for 200 batches at warm-up learning rate 0.0001
* Standard loss reporting from 141 validation reads held out of training. 
* Standard loss report: chunk length = 5500 & sub-batch size = 48 for 10 sub-batches. 
* Gradient L2 norm cap will be upper 0.05 quantile of the last 100 norms.
* Training
..................................................     1 0.10303 0.09424  114.06s (168.34 ksample/s 18.76 kbase/s) lr=1.00e-04  23.7% chunks filtered
..................................................     2 0.10134 0.09381  115.19s (166.80 ksample/s 18.57 kbase/s) lr=1.00e-04  23.0% chunks filtered
..................................................     3 0.10066 0.09348  121.96s (157.53 ksample/s 17.53 kbase/s) lr=1.00e-04  23.4% chunks filtered
..................................................     4 0.10001 0.09326  115.78s (165.98 ksample/s 18.45 kbase/s) lr=1.00e-04  23.1% chunks filtered
..................................................     5 0.10257 0.09576  112.31s (170.96 ksample/s 19.06 kbase/s) lr=1.60e-03  22.9% chunks filtered
..................................................     6 0.10423 0.09548  112.72s (170.15 ksample/s 18.97 kbase/s) lr=1.60e-03  22.6% chunks filtered
..................................................     7 0.10399 0.09554  116.29s (165.05 ksample/s 18.44 kbase/s) lr=1.60e-03  22.5% chunks filtered
..................................................     8 0.10446 0.09625  115.58s (166.09 ksample/s 18.52 kbase/s) lr=1.60e-03  22.6% chunks filtered
..................................................     9 0.10420 0.09789  115.14s (166.58 ksample/s 18.57 kbase/s) lr=1.60e-03  22.7% chunks filtered
..................................................    10 0.10316 0.09563  111.75s (171.81 ksample/s 19.07 kbase/s) lr=1.60e-03  22.6% chunks filtered
..................................................    11 0.10285 0.09823  113.14s (169.61 ksample/s 18.91 kbase/s) lr=1.60e-03  22.6% chunks filtered
..................................................    12 0.10350 0.09618  113.01s (169.96 ksample/s 18.88 kbase/s) lr=1.60e-03  22.5% chunks filtered
..................................................    13 0.10436 0.09668  118.44s (162.04 ksample/s 18.04 kbase/s) lr=1.60e-03  22.4% chunks filtered
..................................................    14 0.10136 0.09609  120.23s (159.74 ksample/s 17.71 kbase/s) lr=1.60e-03  22.5% chunks filtered
..................................................    15 0.10160 0.09668  114.07s (168.49 ksample/s 18.78 kbase/s) lr=1.60e-03  22.3% chunks filtered
..................................................    16 0.10144 0.09606  121.89s (157.41 ksample/s 17.53 kbase/s) lr=1.60e-03  22.4% chunks filtered
..................................................    17 0.10180 0.09654  113.69s (168.87 ksample/s 18.75 kbase/s) lr=1.60e-03  22.4% chunks filtered
..................................................    18 0.10381 0.09578  114.91s (166.92 ksample/s 18.63 kbase/s) lr=1.60e-03  22.3% chunks filtered
..................................................    19 0.10369 0.09621  119.26s (161.01 ksample/s 17.88 kbase/s) lr=1.60e-03  22.4% chunks filtered
.................................................C    20 0.10542 0.09585  116.87s (164.28 ksample/s 18.32 kbase/s) lr=1.60e-03  22.4% chunks filtered
..................................................    21 0.10229 0.09613  112.86s (170.12 ksample/s 18.89 kbase/s) lr=1.60e-03  22.4% chunks filtered
..................................................    22 0.10319 0.09585  119.18s (161.01 ksample/s 17.88 kbase/s) lr=1.60e-03  22.4% chunks filtered
..................................................    23 0.10428 0.09592  112.53s (170.74 ksample/s 19.03 kbase/s) lr=1.60e-03  22.3% chunks filtered
..................................................    24 0.10430 0.09659  118.36s (162.13 ksample/s 18.06 kbase/s) lr=1.60e-03  22.3% chunks filtered
..................................................    25 0.10297 0.09839  114.13s (168.21 ksample/s 18.72 kbase/s) lr=1.60e-03  22.3% chunks filtered
..................................................    26 0.10191 0.09591  119.56s (160.64 ksample/s 17.88 kbase/s) lr=1.60e-03  22.4% chunks filtered
..................................................    27 0.10201 0.09587  119.00s (161.43 ksample/s 18.00 kbase/s) lr=1.59e-03  22.4% chunks filtered
..................................................    28 0.10112 0.09673  115.11s (166.81 ksample/s 18.54 kbase/s) lr=1.59e-03  22.3% chunks filtered
..................................................    29 0.10262 0.09598  116.90s (164.09 ksample/s 18.27 kbase/s) lr=1.59e-03  22.3% chunks filtered
..................................................    30 0.10112 0.09629  119.62s (160.46 ksample/s 17.91 kbase/s) lr=1.59e-03  22.4% chunks filtered
..................................................    31 0.10133 0.09634  113.04s (169.94 ksample/s 18.97 kbase/s) lr=1.59e-03  22.4% chunks filtered
..................................................    32 0.10289 0.09638  114.80s (167.16 ksample/s 18.62 kbase/s) lr=1.59e-03  22.4% chunks filtered
..................................................    33 0.10126 0.09713  118.53s (162.10 ksample/s 18.09 kbase/s) lr=1.59e-03  22.4% chunks filtered
..................................................    34 0.10132 0.09618  114.32s (167.75 ksample/s 18.63 kbase/s) lr=1.59e-03  22.4% chunks filtered
..................................................    35 0.10256 0.09735  114.83s (167.14 ksample/s 18.61 kbase/s) lr=1.59e-03  22.4% chunks filtered
..................................................    36 0.10273 0.09627  118.25s (162.45 ksample/s 18.04 kbase/s) lr=1.59e-03  22.5% chunks filtered
..................................................    37 0.10070 0.09690  119.18s (160.95 ksample/s 17.92 kbase/s) lr=1.59e-03  22.5% chunks filtered
..................................................    38 0.10240 0.09587  122.12s (157.21 ksample/s 17.53 kbase/s) lr=1.59e-03  22.5% chunks filtered
..................................................    39 0.10228 0.09652  121.98s (157.32 ksample/s 17.50 kbase/s) lr=1.59e-03  22.5% chunks filtered
.................................................C    40 0.10154 0.09753  119.72s (160.40 ksample/s 17.86 kbase/s) lr=1.59e-03  22.5% chunks filtered
..................................................    41 0.10145 0.09766  116.83s (164.52 ksample/s 18.34 kbase/s) lr=1.59e-03  22.5% chunks filtered
..................................................    42 0.10305 0.09752  116.33s (165.06 ksample/s 18.37 kbase/s) lr=1.59e-03  22.5% chunks filtered
..................................................    43 0.10309 0.09718  113.95s (168.32 ksample/s 18.77 kbase/s) lr=1.58e-03  22.5% chunks filtered
..................................................    44 0.10519 0.09719  117.84s (163.13 ksample/s 18.14 kbase/s) lr=1.58e-03  22.5% chunks filtered
..................................................    45 0.10280 0.09720  114.28s (167.86 ksample/s 18.66 kbase/s) lr=1.58e-03  22.5% chunks filtered
..................................................    46 0.10338 0.09686  117.43s (163.33 ksample/s 18.20 kbase/s) lr=1.58e-03  22.5% chunks filtered
..................................................    47 0.10158 0.09750  117.49s (163.25 ksample/s 18.12 kbase/s) lr=1.58e-03  22.5% chunks filtered
..................................................    48 0.10328 0.09947  117.06s (163.95 ksample/s 18.26 kbase/s) lr=1.58e-03  22.5% chunks filtered
..................................................    49 0.10487 0.09910  115.14s (166.81 ksample/s 18.65 kbase/s) lr=1.58e-03  22.6% chunks filtered
..................................................    50 0.10118 0.09751  120.68s (159.18 ksample/s 17.63 kbase/s) lr=1.58e-03  22.6% chunks filtered
..................................................    51 0.10375 0.09986  115.22s (166.70 ksample/s 18.58 kbase/s) lr=1.58e-03  22.6% chunks filtered
..................................................    52 0.10603 0.09855  119.27s (161.08 ksample/s 17.95 kbase/s) lr=1.58e-03  22.6% chunks filtered
..................................................    53 0.10155 0.09740  116.16s (165.29 ksample/s 18.38 kbase/s) lr=1.58e-03  22.6% chunks filtered
..................................................    54 0.10260 0.09768  112.45s (170.72 ksample/s 18.98 kbase/s) lr=1.57e-03  22.6% chunks filtered
..................................................    55 0.10155 0.09779  116.01s (165.47 ksample/s 18.41 kbase/s) lr=1.57e-03  22.6% chunks filtered
..................................................    56 0.10258 0.09721  117.70s (163.23 ksample/s 18.10 kbase/s) lr=1.57e-03  22.6% chunks filtered
..................................................    57 0.10468 0.09874  118.00s (162.63 ksample/s 18.08 kbase/s) lr=1.57e-03  22.6% chunks filtered
..................................................    58 0.10292 0.09673  120.48s (159.47 ksample/s 17.73 kbase/s) lr=1.57e-03  22.6% chunks filtered
..................................................    59 0.10120 0.09683  116.46s (164.98 ksample/s 18.34 kbase/s) lr=1.57e-03  22.6% chunks filtered
.................................................C    60 0.10291 0.09715  111.07s (172.95 ksample/s 19.24 kbase/s) lr=1.57e-03  22.6% chunks filtered
..................................................    61 0.10265 0.09764  117.70s (163.35 ksample/s 18.17 kbase/s) lr=1.57e-03  22.6% chunks filtered
..................................................    62 0.10124 0.09724  118.02s (162.72 ksample/s 18.12 kbase/s) lr=1.57e-03  22.6% chunks filtered
..................................................    63 0.10444 0.09788  116.20s (165.41 ksample/s 18.37 kbase/s) lr=1.56e-03  22.6% chunks filtered
..................................................    64 0.10290 0.09740  115.10s (166.85 ksample/s 18.55 kbase/s) lr=1.56e-03  22.6% chunks filtered
..................................................    65 0.10396 0.09741  116.88s (164.25 ksample/s 18.32 kbase/s) lr=1.56e-03  22.6% chunks filtered
..................................................    66 0.10418 0.09737  120.62s (159.13 ksample/s 17.71 kbase/s) lr=1.56e-03  22.6% chunks filtered
..................................................    67 0.10352 0.09803  113.83s (168.62 ksample/s 18.76 kbase/s) lr=1.56e-03  22.6% chunks filtered
..................................................    68 0.10091 0.09806  114.76s (167.36 ksample/s 18.64 kbase/s) lr=1.56e-03  22.6% chunks filtered
..................................................    69 0.10277 0.09752  109.89s (174.39 ksample/s 19.44 kbase/s) lr=1.56e-03  22.6% chunks filtered
..................................................    70 0.10152 0.09761  121.69s (157.95 ksample/s 17.54 kbase/s) lr=1.56e-03  22.6% chunks filtered
..................................................    71 0.10293 0.09910  112.92s (169.99 ksample/s 18.91 kbase/s) lr=1.55e-03  22.7% chunks filtered
..................................................    72 0.10342 0.09704  115.96s (165.39 ksample/s 18.41 kbase/s) lr=1.55e-03  22.7% chunks filtered
..................................................    73 0.10179 0.09838  112.01s (171.42 ksample/s 19.05 kbase/s) lr=1.55e-03  22.6% chunks filtered
..................................................    74 0.10373 0.09874  111.87s (171.61 ksample/s 19.09 kbase/s) lr=1.55e-03  22.6% chunks filtered
..................................................    75 0.10389 0.09729  110.51s (173.63 ksample/s 19.31 kbase/s) lr=1.55e-03  22.6% chunks filtered
..................................................    76 0.10115 0.09802  116.14s (165.44 ksample/s 18.41 kbase/s) lr=1.55e-03  22.6% chunks filtered
..................................................    77 0.10189 0.09826  121.15s (158.53 ksample/s 17.68 kbase/s) lr=1.55e-03  22.6% chunks filtered
..................................................    78 0.10195 0.09802  115.58s (165.89 ksample/s 18.42 kbase/s) lr=1.54e-03  22.6% chunks filtered
..................................................    79 0.10193 0.09749  120.47s (159.53 ksample/s 17.80 kbase/s) lr=1.54e-03  22.6% chunks filtered
.................................................C    80 0.10151 0.09734  121.18s (158.53 ksample/s 17.65 kbase/s) lr=1.54e-03  22.6% chunks filtered
..................................................    81 0.10123 0.09833  114.94s (166.94 ksample/s 18.59 kbase/s) lr=1.54e-03  22.6% chunks filtered
..................................................    82 0.10002 0.09774  122.28s (157.09 ksample/s 17.51 kbase/s) lr=1.54e-03  22.6% chunks filtered
..................................................    83 0.10148 0.09776  120.81s (158.98 ksample/s 17.69 kbase/s) lr=1.54e-03  22.6% chunks filtered
..................................................    84 0.10369 0.09914  118.36s (162.27 ksample/s 18.08 kbase/s) lr=1.54e-03  22.7% chunks filtered
..................................................    85 0.10256 0.09831  118.93s (161.22 ksample/s 17.93 kbase/s) lr=1.53e-03  22.7% chunks filtered
..................................................    86 0.10094 0.09856  117.98s (162.84 ksample/s 18.06 kbase/s) lr=1.53e-03  22.7% chunks filtered
..................................................    87 0.10262 0.09856  115.67s (165.98 ksample/s 18.48 kbase/s) lr=1.53e-03  22.7% chunks filtered
..................................................    88 0.10268 0.09754  114.59s (167.54 ksample/s 18.67 kbase/s) lr=1.53e-03  22.6% chunks filtered
..................................................    89 0.10122 0.09729  115.43s (166.36 ksample/s 18.49 kbase/s) lr=1.53e-03  22.6% chunks filtered
..................................................    90 0.10119 0.09826  117.85s (162.79 ksample/s 18.10 kbase/s) lr=1.53e-03  22.6% chunks filtered
..................................................    91 0.10304 0.09846  113.75s (168.66 ksample/s 18.74 kbase/s) lr=1.52e-03  22.6% chunks filtered
..................................................    92 0.10238 0.09824  112.37s (171.00 ksample/s 19.04 kbase/s) lr=1.52e-03  22.6% chunks filtered
..................................................    93 0.10246 0.09978  113.78s (168.81 ksample/s 18.80 kbase/s) lr=1.52e-03  22.6% chunks filtered
..................................................    94 0.10248 0.09794  116.51s (164.81 ksample/s 18.42 kbase/s) lr=1.52e-03  22.6% chunks filtered
..................................................    95 0.10194 0.09909  116.15s (165.32 ksample/s 18.43 kbase/s) lr=1.52e-03  22.6% chunks filtered
..................................................    96 0.10665 0.10063  112.00s (171.39 ksample/s 19.09 kbase/s) lr=1.51e-03  22.6% chunks filtered
..................................................    97 0.10250 0.09883  119.84s (160.29 ksample/s 17.83 kbase/s) lr=1.51e-03  22.6% chunks filtered
..................................................    98 0.10183 0.09980  121.37s (158.07 ksample/s 17.59 kbase/s) lr=1.51e-03  22.6% chunks filtered
..................................................    99 0.10106 0.09903  123.88s (155.04 ksample/s 17.33 kbase/s) lr=1.51e-03  22.6% chunks filtered
.................................................C   100 0.10287 0.09871  116.61s (164.78 ksample/s 18.32 kbase/s) lr=1.51e-03  22.6% chunks filtered
..................................................   101 0.09959 0.09830  115.93s (165.56 ksample/s 18.41 kbase/s) lr=1.51e-03  22.6% chunks filtered
..................................................   102 0.10232 0.10040  116.30s (165.17 ksample/s 18.37 kbase/s) lr=1.50e-03  22.6% chunks filtered
..................................................   103 0.10204 0.09922  113.61s (168.76 ksample/s 18.80 kbase/s) lr=1.50e-03  22.6% chunks filtered
..................................................   104 0.10118 0.10029  115.30s (166.43 ksample/s 18.55 kbase/s) lr=1.50e-03  22.6% chunks filtered
..................................................   105 0.10316 0.09928  119.12s (161.12 ksample/s 17.96 kbase/s) lr=1.50e-03  22.7% chunks filtered
..................................................   106 0.10159 0.09964  114.67s (167.49 ksample/s 18.65 kbase/s) lr=1.50e-03  22.6% chunks filtered
..................................................   107 0.10175 0.09931  115.64s (165.94 ksample/s 18.48 kbase/s) lr=1.49e-03  22.6% chunks filtered
..................................................   108 0.10043 0.09866  125.75s (152.72 ksample/s 17.02 kbase/s) lr=1.49e-03  22.7% chunks filtered
..................................................   109 0.10020 0.09867  118.72s (161.77 ksample/s 18.02 kbase/s) lr=1.49e-03  22.7% chunks filtered
..................................................   110 0.10340 0.09991  116.44s (164.87 ksample/s 18.40 kbase/s) lr=1.49e-03  22.7% chunks filtered
..................................................   111 0.10187 0.09888  120.00s (160.10 ksample/s 17.83 kbase/s) lr=1.49e-03  22.7% chunks filtered
..................................................   112 0.10131 0.09886  116.21s (165.18 ksample/s 18.35 kbase/s) lr=1.48e-03  22.7% chunks filtered
..................................................   113 0.10127 0.09878  119.95s (159.88 ksample/s 17.79 kbase/s) lr=1.48e-03  22.7% chunks filtered
..................................................   114 0.10092 0.09867  116.38s (164.89 ksample/s 18.39 kbase/s) lr=1.48e-03  22.7% chunks filtered
..................................................   115 0.10235 0.09870  117.45s (163.42 ksample/s 18.19 kbase/s) lr=1.48e-03  22.7% chunks filtered
..................................................   116 0.10141 0.09837  118.39s (162.21 ksample/s 18.07 kbase/s) lr=1.47e-03  22.7% chunks filtered
..................................................   117 0.10130 0.09781  121.45s (158.15 ksample/s 17.60 kbase/s) lr=1.47e-03  22.7% chunks filtered
..................................................   118 0.09917 0.09845  121.79s (157.63 ksample/s 17.54 kbase/s) lr=1.47e-03  22.7% chunks filtered
..................................................   119 0.10108 0.09849  118.65s (161.84 ksample/s 17.97 kbase/s) lr=1.47e-03  22.7% chunks filtered
.................................................C   120 0.09964 0.09825  121.11s (158.32 ksample/s 17.58 kbase/s) lr=1.47e-03  22.7% chunks filtered
..................................................   121 0.10007 0.09785  115.92s (165.46 ksample/s 18.38 kbase/s) lr=1.46e-03  22.7% chunks filtered
..................................................   122 0.09985 0.09845  114.33s (168.07 ksample/s 18.72 kbase/s) lr=1.46e-03  22.7% chunks filtered
..................................................   123 0.09904 0.09813  120.89s (158.90 ksample/s 17.63 kbase/s) lr=1.46e-03  22.7% chunks filtered
..................................................   124 0.10266 0.09908  117.09s (163.89 ksample/s 18.23 kbase/s) lr=1.46e-03  22.7% chunks filtered
..................................................   125 0.10281 0.09849  118.34s (162.25 ksample/s 18.06 kbase/s) lr=1.45e-03  22.7% chunks filtered
..................................................   126 0.10141 0.09834  116.42s (164.98 ksample/s 18.35 kbase/s) lr=1.45e-03  22.7% chunks filtered
..................................................   127 0.09954 0.09840  117.82s (162.89 ksample/s 18.10 kbase/s) lr=1.45e-03  22.7% chunks filtered
..................................................   128 0.09923 0.09771  115.27s (166.61 ksample/s 18.57 kbase/s) lr=1.45e-03  22.7% chunks filtered
..................................................   129 0.10268 0.09790  110.59s (173.61 ksample/s 19.28 kbase/s) lr=1.45e-03  22.7% chunks filtered
..................................................   130 0.09932 0.09760  115.02s (166.87 ksample/s 18.58 kbase/s) lr=1.44e-03  22.7% chunks filtered
..................................................   131 0.10253 0.09847  113.70s (168.84 ksample/s 18.83 kbase/s) lr=1.44e-03  22.7% chunks filtered
..................................................   132 0.09926 0.09810  114.94s (166.99 ksample/s 18.55 kbase/s) lr=1.44e-03  22.7% chunks filtered
..................................................   133 0.10006 0.09842  120.04s (160.03 ksample/s 17.81 kbase/s) lr=1.44e-03  22.7% chunks filtered
..................................................   134 0.10196 0.09928  115.63s (166.08 ksample/s 18.52 kbase/s) lr=1.43e-03  22.7% chunks filtered
..................................................   135 0.09793 0.09739  119.16s (161.16 ksample/s 17.89 kbase/s) lr=1.43e-03  22.7% chunks filtered
..................................................   136 0.10120 0.09817  113.37s (169.33 ksample/s 18.89 kbase/s) lr=1.43e-03  22.7% chunks filtered
..................................................   137 0.10041 0.09765  110.80s (173.22 ksample/s 19.23 kbase/s) lr=1.43e-03  22.7% chunks filtered
......Traceback (most recent call last):
  File "/opt/kgapps/taiyaki/bin/train_flipflop.py", line 4, in <module>
    __import__('pkg_resources').run_script('taiyaki==5.1.0', 'train_flipflop.py')
  File "/opt/kgapps/taiyaki/lib/python3.7/site-packages/pkg_resources/__init__.py", line 666, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/opt/kgapps/taiyaki/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1446, in run_script
    exec(code, namespace, namespace)
  File "/opt/kgapps/taiyaki/lib/python3.7/site-packages/taiyaki-5.1.0-py3.7-linux-x86_64.egg/EGG-INFO/scripts/train_flipflop.py", line 624, in <module>
    main()
  File "/opt/kgapps/taiyaki/lib/python3.7/site-packages/taiyaki-5.1.0-py3.7-linux-x86_64.egg/EGG-INFO/scripts/train_flipflop.py", line 541, in main
    mod_factor_t, calc_grads = True )
  File "/opt/kgapps/taiyaki/lib/python3.7/site-packages/taiyaki-5.1.0-py3.7-linux-x86_64.egg/EGG-INFO/scripts/train_flipflop.py", line 247, in calculate_loss
    outputs, seqs, seqlens, sharpen)
  File "taiyaki/ctc/ctc.pyx", line 88, in taiyaki.ctc.ctc.FlipFlopCRF.forward
  File "taiyaki/ctc/ctc.pyx", line 62, in taiyaki.ctc.ctc.crf_flipflop_grad
AssertionError: Input not finite

Do you have any idea what is going on here, and what we are doing wrong?

dikkeaap avatar Oct 09 '20 11:10 dikkeaap

This is an area of active research internally. Currently the best solution/workaround is to decrease the --max_lr and increase the --niterations (and maybe --lr_cosine_iters).

marcus1487 avatar Oct 09 '20 21:10 marcus1487

I just generated the same exception using the example training set:

  https://s3-eu-west-1.amazonaws.com/ont-research/taiyaki_walkthrough.tar.gz

while following the taiyaki walk through instructions:

  https://github.com/nanoporetech/taiyaki/blob/master/docs/walkthrough.rst

though it took 556 iterations before it failed.

FWIW.

SCDealy avatar Nov 02 '20 20:11 SCDealy