deepmind-research nfnets: training error

nfnets: training error

Open purvang3 opened this issue 4 years ago • 1 comments

First of all, thank you for great publish nfnets. I have started deeging deep in to implementation, where I have some questions.

Unfortunately I am not able to run experiment.py. I am getting following error. I am running on just one gpu for testing. Screen Shot 2021-02-18 at 6 40 53 PM

when I run test.py using fake data, it is working without any error.

Thank you

Feb 19 '21 02:02 purvang3

With a bit of digging around I managed to get past the error above.

Add the following line to experiment.py

if __name__ == '__main__':
  FLAGS(sys.argv) # <- add this line
  flags.mark_flag_as_required('config')
  platform.main(Experiment, sys.argv[1:])

Add the following lines in the definition of get_config(), in experiment.py:

  config.save_checkpoint_interval = 60
  config.eval_specific_checkpoint_dir = ''
  config.checkpoint_dir = '/path/' # <- add this (modify /path/ appropriately)
  config.train_checkpoint_all_hosts = True # <- and this

  return config

Run experiment.py with --config argument, as follows:

python nfnets/experiment.py --config nfnets/experiment.py

The published version of deepmind/jaxline is outdated, perhaps?

PS: Even with this workaround, training halts with TypeError, but that's yet another issue...

Apr 23 '21 05:04 nss-ysasaki

deepmind-research deepmind-research copied to clipboard

nfnets: training error

deepmind-research
deepmind-research copied to clipboard