deepmind-research icon indicating copy to clipboard operation
deepmind-research copied to clipboard

nfnets: training error

Open purvang3 opened this issue 4 years ago • 1 comments

First of all, thank you for great publish nfnets. I have started deeging deep in to implementation, where I have some questions.

Unfortunately I am not able to run experiment.py. I am getting following error. I am running on just one gpu for testing. Screen Shot 2021-02-18 at 6 40 53 PM

when I run test.py using fake data, it is working without any error.

Thank you

purvang3 avatar Feb 19 '21 02:02 purvang3

With a bit of digging around I managed to get past the error above.

  1. Add the following line to experiment.py
if __name__ == '__main__':
  FLAGS(sys.argv) # <- add this line
  flags.mark_flag_as_required('config')
  platform.main(Experiment, sys.argv[1:])
  1. Add the following lines in the definition of get_config(), in experiment.py:
  config.save_checkpoint_interval = 60
  config.eval_specific_checkpoint_dir = ''
  config.checkpoint_dir = '/path/' # <- add this (modify /path/ appropriately)
  config.train_checkpoint_all_hosts = True # <- and this

  return config
  1. Run experiment.py with --config argument, as follows:
python nfnets/experiment.py --config nfnets/experiment.py

The published version of deepmind/jaxline is outdated, perhaps?

PS: Even with this workaround, training halts with TypeError, but that's yet another issue...

nss-ysasaki avatar Apr 23 '21 05:04 nss-ysasaki