sample-factory
sample-factory copied to clipboard
[question] How to run evals during training?
I am trying to figure out how I could run quick evals during training.
I saw an old issue about creating an eval worker using the current infrastructure, but that would most likely take me quite some time to implement.
I was wondering if there was a quick and dirty way I could do this?
For instance is there a way I could access the policy from within the environment, and thus run evals in the reset function?
I could also run a bash loop that alternates between the training script and the eval script, but reinitializing the training script every 1M steps, although automated, would probably be quite a slowdown.
Thanks!
Hi @nathanlct I'd say the easiest way is to modify the actor worker class to make one of them "special" in some way.
I.e. you can make actor_worker #0 into the evaluation worker: in actor_worker.py just check if worker_idx == 0 and add whatever special environment settings you need + disable sending worker #0 experience for learning. I think a hacky implementation can be put together in a few hours :) I am happy to help if you run into issues!
Also, I'm considering this feature for Sample Factory 2.0 since it was requested more than once!
@alex-petrenko Thank you for the answer! That's very appreciated.
I am trying to put something together but am still struggling to understand how the code works.
So from what you said, I'm understanding the following:
- if
worker_idx == 0
, modifycfg
in the__init__
before theVectorEnvRunner
is created, so that the environment will be configured the way I want it for my reward - if
worker_idx == 0
, comment out the line where data is put into thelearner_queues
And I guess I should keep the rest of the structure as is? The other option (since I don't need performances for eval, just a few seconds of simulating one single env every couple million steps) would be to create my own instance of my Env in that eval ActorWorker
, but then I would need a way to directly access the policy and query it, which I'm not sure plays well with the current code structure.
Assuming I'm keeping the structure, I have two follow-up questions on point 1.
- Would you have some insight into how I should go about reporting a couple metrics/plots at the end of my environment? Should my environment export metrics in the info dict at the very last step of the horizon, or can I have access to my environment(s) from the worker? I am seeing some things happening with a
report_queue
so I suppose that's what I should use, though I don't fully understand how it works. - What if I want to evaluate my env with different settings (ie with several different
cfg
)? Of course I could implement that in the env but I was wondering if you thought of something easier that I could have misunderstood.
Thanks! I'll keep investigating this in the meantime but I figured I'd better ask first before losing a lot of time, in case this is trivial for you.
Also realizing now that since I barely need any performances for eval, another solution could be to spawn a subprocess (at the start of training) that runs my eval (enjoy) script every n seconds, and that script logs values to wandb. So I guess that's another hacky way to create an eval worker haha
Hi @nathanlct ! Sorry for the delay
I am trying to put something together but am still struggling to understand how the code works.
So from what you said, I'm understanding the following:
if worker_idx == 0, modify cfg in the init before the VectorEnvRunner is created, so that the environment will be configured the way I want it for my reward if worker_idx == 0, comment out the line where data is put into the learner_queues
Yes, this is roughly what I meant :) You might need to do a bit of extra work to have separate evaluation summaries on your tensorboard / wandb.
And I guess I should keep the rest of the structure as is? The other option (since I don't need performances for eval, just a few seconds of simulating one single env every couple million steps) would be to create my own instance of my Env in that eval ActorWorker, but then I would need a way to directly access the policy and query it, which I'm not sure plays well with the current code structure.
This is also an option. But consider that by using the existing architecture you can evaluate on more environments/episodes thus getting more accurate estimates with smaller variance. Performance is always good! :)
Assuming I'm keeping the structure, I have two follow-up questions on point 1.
Would you have some insight into how I should go about reporting a couple metrics/plots at the end of my environment? Should my environment export metrics in the info dict at the very last step of the horizon, or can I have access to my environment(s) from the worker? I am seeing some things happening with a report_queue so I suppose that's what I should use, though I don't fully understand how it works.
Workers communicate by sending messages to queues. Summaries are reported on the main loop, but your evaluation worker is a separate process so you will need to send a message to the report queue indeed. This function processes the messages: https://github.com/alex-petrenko/sample-factory/blob/6671a11cede229d37ea4f88cc17d5b6fb2494fb1/sample_factory/algorithms/appo/appo.py#L520 It is a bit hacky but should be very straightforward to modify!
I think the easiest way to figure everything out is to run the code under a debugger and set a breakpoint to see what is being written into the queue. I suggest that you return a bunch of values in info['episode_extra_stats'] (see here: https://github.com/alex-petrenko/sample-factory/blob/6671a11cede229d37ea4f88cc17d5b6fb2494fb1/sample_factory/algorithms/appo/actor_worker.py#L204) I believe anything that you pass there on the last step of your episode will automatically be added to Tensorboard/Wandb
What if I want to evaluate my env with different settings (ie with several different cfg)? Of course I could implement that in the env but I was wondering if you thought of something easier that I could have misunderstood.
That's a good question... I think I would either create multiple envs on the same eval worker with different cfgs, or you can have multiple eval workers if you want. If you modify summary keys in 'episode_extra_stats' for each configuration you should be able to easily report all of your configurations. For example, here https://github.com/alex-petrenko/sample-factory/blob/0bfa7e0bedde2419b56fe12ea72ea73f4b1149b7/sample_factory/envs/dmlab/wrappers/reward_shaping.py#L36 we use these stats to report results on different levels in DMLab-30.
Also realizing now that since I barely need any performances for eval, another solution could be to spawn a subprocess (at the start of training) that runs my eval (enjoy) script every n seconds, and that script logs values to wandb. So I guess that's another hacky way to create an eval worker haha
I guess you can, you can even completely separate it from the main training and just load the latest checkpoint from the filesystem. Hacky indeed, probably slow, but if it works it works! :)