Eugene Selivonchyk
Eugene Selivonchyk
This is what I got after running the gmonitor the first time: ``` gmonitor -s Failed to connect to Mir: Failed to connect to server socket: No such file or...
feature: configure IF_NAME to be used by SMDATAPARALLEL Forward SageMaker network_if_name to be used by SMDATAPARALLEL as an env variable over mpi launch Issue #, if available: Description of changes:...
Past auto-tune getting a BatchPrefillWithPagedKVCacheRun crash with SIGFPE consistently on a B200 (SM10). Same configuration worked on H200. verl=0.5.0 vllm==0.10.2 flashinfer=0.3.1 ``` Ray worker └─ ray._private.workers.default_worker:main_loop └─ ray._private.function_manager:actor_method_executor └─ verl.single_controller.ray.base:func...