fsiino-nvidia

Results 6 issues of fsiino-nvidia

This implements the `ng_status` command to list all running servers on the system and ping for health check.

core-infra
Usability

We've noticed some lack of clarity with the huggingface upload flow as devs contribute their datasets. Notes: - hf slug. some thought it was `nemo-gym`, it should be an alphanumeric...

documentation

**Describe the bug** External users cannot train using NeMo Gym because GitLab integration is hardcoded as a requirement. The DatasetConfig validator enforces that all train/validation datasets must have a gitlab_identifier,...

core-infra

This change adds a cleaner way to manage the stopping of specific running servers via `ng_stop`.

core-infra
Usability

**Describe the bug** The median value in dataset metrics (train_data_utils.py) produces different results on each run, even with identical input data. This causes validation failures when comparing metrics files. The...

core-infra

https://nvidia.slack.com/archives/C08TG7CLEGY/p1766191655660079 Initially in #290 , the `response_class=PlainTextResponse` was added to the `/global_config_dict_yaml` endpoint of the HeadServer as an attempt to debug parsing server info for the `ng_status` command. This lead...