cmunley1

Results 17 comments of cmunley1

shared the branch and we will test the PR above. thanks @ZhiyuLi-Nvidia

Thanks both. We are testing this

Unable to reproduce this customer issue so far.

The latest issue using zhiyul/oom_repro_w_cpu_profiler_optional_rm_data is: training ran for 32 steps then crashed with: ``` [36m(MegatronPolicyWorker[rank=34] pid=1840232, ip=10.5.33.3)[0m [2025-11-25 15:20:28,463 E 1840232 1840232] logging.cc:118: Unhandled exception: N3c105ErrorE. what(): could not...

Colab error without this feature: ``` (Gym) /content/Gym# ng_run "+config_paths=[resources_servers/reasoning_gym/configs/resources_only.yaml]" Starting Ray cluster... 2025-12-19 19:16:42,045 INFO worker.py:2004 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 /content/Gym/.venv/lib/python3.12/site-packages/ray/_private/worker.py:2052: FutureWarning:...

resolved issue using this feature: ``` (Gym) /content/Gym# git checkout remotes/origin/cmunley1/colab Note: switching to 'remotes/origin/cmunley1/colab'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit...

testing `ng_run` with flag on regular (brev) node looks fine: ``` (gym-test-colab) ubuntu@brev-64egbpb2a:~/gym-test-colab$ ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]" +uv_pip_set_python=false Starting Ray cluster... 2025-12-19 19:11:13,738 INFO worker.py:2004 -- Started a local Ray instance. View...

another test outside of colab: ``` (gym-test-colab) ubuntu@brev-64egbpb2a:~/gym-test-colab$ ng_run "+config_paths=[resources_servers/instruction_following[2025-12-19 19:33:35,726 E 3584055 3584848] rpc_client.h:201: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either...

might remove berman agent before merging