char-rnn icon indicating copy to clipboard operation
char-rnn copied to clipboard

Should init_from parameter start train.lua from 1 if multiple checkpoints exist?

Open davidlfox opened this issue 10 years ago • 5 comments
trafficstars

When the process restarted, it started counting up from 1. Is init_from not used as a "pause" kind of parameter in the event I need to shut down my box between trainings? Or is the counting just incorrect?

  • ran train.lua for a couple hours to generate 10 checkpoints
  • stopped process
  • ran train_lua with init_from parameter pointing to the latest checkpoint file

davidlfox avatar Jun 14 '15 19:06 davidlfox

I did see #33, but it doesn't specifically mention display count.

davidlfox avatar Jun 14 '15 19:06 davidlfox

Hi @davidlfox , what you're observing is the current intended behavior, hence the name init_from rather than resume_from. The issue is that resuming precisely would be a bit tricky (e.g. state of the optimizer would have to be saved in each checkpoint too), and also one might not necessarily want this. Hmm, not sure about this. Do you have a strong use case for exactly resuming?

EDIT: I agree that this should probably exist. Thinking about the API.

karpathy avatar Jun 14 '15 21:06 karpathy

I don't know if it's strong, but my use case is just as I described: powering off a box in the middle of a long training.

davidlfox avatar Jun 15 '15 19:06 davidlfox

+1 for having a resume_from feature (I have the same use-case: powering off a box in the middle of training).

R-Gerard avatar Nov 09 '15 03:11 R-Gerard

A flag to enable the full saving of state might be the best route.

whackashoe avatar Dec 14 '15 10:12 whackashoe