verifiers
verifiers copied to clipboard
add GymEnv
Description
Adds GymEnv, an optional Environment subclass that runs classic reset/step simulator loops. This lets you plug Gym-style environments directly into verifiers (including custom simulators defined inside this repo) without first converting them into a static dataset.
GymEnv supports:
- Homogeneous mode (
env_cls): one env class with optionalenv_kwargs; dataset rows feed directly intoreset(**info). - Heterogeneous mode (
env_registry): a registry of env classes; each dataset row specifiesinfo.env_typeand optionalinfo.env_kwargsto select/configure the env per rollout. - Custom mode (subclass +
_make_env): full control over environment construction.
Additional features:
- automatic dataset generation when none is provided (so RL training “just works”),
- optional evaluation override via a user-supplied
eval_runner.
GymEnv is fully opt-in and does not affect existing environments.
Type of Change
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update
- [ ] Test improvement
Testing
- [x] All existing tests pass when running
uv run pytestlocally. - [x] New tests have been added to cover the changes
Checklist
- [x] My code follows the style guidelines of this project as outlined in AGENTS.md
- [ ] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] Any dependent changes have been merged and published
Additional Notes
- Dataset & trainer integration
Environmentrequires at least one ofdatasetoreval_dataset.
GymEnvnow handles this automatically:- If you provide a dataset or eval dataset explicitly, they are used as-is.
- If you provide no datasets and keep
auto_dummy_eval=True:- We auto-build a training dataset of length
num_train_episodesvia_build_auto_dataset(...). - In homogeneous mode (
env_clsset):dataset =this auto dataset.eval_dataset =a 1-row dummy eval set, preserving “episodes mode” behavior wherenum_examplesmaps cleanly torollouts_per_example.
- In heterogeneous mode (
env_registryset):- We auto-generate rows where each
infocontains a validenv_type(round-robin over registry keys) and defaultenv_kwargs={}. - This auto dataset is used for both
datasetandeval_dataset, ensuring all rows map cleanly into actual env instances.
- We auto-generate rows where each
- We auto-build a training dataset of length
TBD
- Demonstrate an end-to-end RL training run using
RLTrainer+GymEnv(homogeneous and heterogeneous cases) to validate.
nice! would you wanna do the PR on top of the trajectories branch? https://github.com/PrimeIntellect-ai/verifiers/pull/549
there's a bunch of new things which should make native gym-style rollouts much easier, especially for training.
also would be ideal if we still had a notion of a dataset, as this is how trainers typically expect to work with verifiers envs
can be generated at init with a given number of hidden state samples (as in textarena_env)
yes! will do it on top of it. I'll just wait for the other to be finished