Clarify the Huggingface dataset upload flow

Open fsiino-nvidia opened this issue 2 weeks ago • 0 comments

We've noticed some lack of clarity with the huggingface upload flow as devs contribute their datasets. Notes:

hf slug. some thought it was nemo-gym, it should be an alphanumeric string. We should provide it or guide the user on how to find it.
outdated validation keys/props when submitting a training split
confusion about passing the resource_config_path
confusion about what dataset_name to pass (if at all)
overall what parameters to pass in and when
what resource_config_fpath to pass in when the dataset is a blend of other datasets?

With this becoming an increasingly common workflow, we should write an official tutorial to make the process more discoverable and detailed.

Example excerpt:

Uploading local datasets to HF. Ensure your HF token has permissions for reading and writing to collections!

ng_upload_dataset_to_hf \
    +hf_organization=nvidia \
    +hf_collection_name="NeMo Gym" \
    +hf_collection_slug=68d1e0902765fbacc937bb4f \
    +dataset_name=workplace_assistant \
    +input_jsonl_fpath=data/workplace_assistant/train.jsonl \
    +resource_config_path=resources_servers/workplace_assistant/configs/workplace_assistant.yaml \
    +split=train

Dec 12 '25 20:12 fsiino-nvidia