deepmind-research
deepmind-research copied to clipboard
nfnets: training error
First of all, thank you for great publish nfnets. I have started deeging deep in to implementation, where I have some questions.
Unfortunately I am not able to run experiment.py. I am getting following error. I am running on just one gpu for testing.
when I run test.py using fake data, it is working without any error.
Thank you
With a bit of digging around I managed to get past the error above.
- Add the following line to
experiment.py
if __name__ == '__main__':
FLAGS(sys.argv) # <- add this line
flags.mark_flag_as_required('config')
platform.main(Experiment, sys.argv[1:])
- Add the following lines in the definition of
get_config()
, inexperiment.py
:
config.save_checkpoint_interval = 60
config.eval_specific_checkpoint_dir = ''
config.checkpoint_dir = '/path/' # <- add this (modify /path/ appropriately)
config.train_checkpoint_all_hosts = True # <- and this
return config
- Run
experiment.py
with--config
argument, as follows:
python nfnets/experiment.py --config nfnets/experiment.py
The published version of deepmind/jaxline is outdated, perhaps?
PS: Even with this workaround, training halts with TypeError
, but that's yet another issue...