fsiino-nvidia
fsiino-nvidia
This implements the `ng_status` command to list all running servers on the system and ping for health check.
We've noticed some lack of clarity with the huggingface upload flow as devs contribute their datasets. Notes: - hf slug. some thought it was `nemo-gym`, it should be an alphanumeric...
**Describe the bug** External users cannot train using NeMo Gym because GitLab integration is hardcoded as a requirement. The DatasetConfig validator enforces that all train/validation datasets must have a gitlab_identifier,...
This change adds a cleaner way to manage the stopping of specific running servers via `ng_stop`.
**Describe the bug** The median value in dataset metrics (train_data_utils.py) produces different results on each run, even with identical input data. This causes validation failures when comparing metrics files. The...
https://nvidia.slack.com/archives/C08TG7CLEGY/p1766191655660079 Initially in #290 , the `response_class=PlainTextResponse` was added to the `/global_config_dict_yaml` endpoint of the HeadServer as an attempt to debug parsing server info for the `ng_status` command. This lead...