Gym
Gym copied to clipboard
Clarify the Huggingface dataset upload flow
We've noticed some lack of clarity with the huggingface upload flow as devs contribute their datasets. Notes:
- hf slug. some thought it was
nemo-gym, it should be an alphanumeric string. We should provide it or guide the user on how to find it. - outdated validation keys/props when submitting a training split
- confusion about passing the
resource_config_path - confusion about what
dataset_nameto pass (if at all) - overall what parameters to pass in and when
- what
resource_config_fpathto pass in when the dataset is a blend of other datasets?
With this becoming an increasingly common workflow, we should write an official tutorial to make the process more discoverable and detailed.
Example excerpt:
Uploading local datasets to HF. Ensure your HF token has permissions for reading and writing to collections!
ng_upload_dataset_to_hf \
+hf_organization=nvidia \
+hf_collection_name="NeMo Gym" \
+hf_collection_slug=68d1e0902765fbacc937bb4f \
+dataset_name=workplace_assistant \
+input_jsonl_fpath=data/workplace_assistant/train.jsonl \
+resource_config_path=resources_servers/workplace_assistant/configs/workplace_assistant.yaml \
+split=train