datasets
datasets copied to clipboard
Document better that loading a dataset passing its name does not use the local script
As reported by @TrentBrick here https://github.com/huggingface/datasets/issues/4725#issuecomment-1191858596, it could be more clear that loading a dataset by passing its name does not use the (modified) local script of it.
What he did:
- he installed
datasetsfrom source - he modified locally
datasets/the_pile/the_pile.pyloading script - he tried to load it but using
load_dataset("the_pile")instead ofload_dataset("datasets/the_pile")- as explained here https://github.com/huggingface/datasets/issues/4725#issuecomment-1191040245:
- the former does not use the local script, but instead it downloads a copy of
the_pile.pyfrom our GitHub, caches it locally (inside~/.cache/huggingface/modules) and uses that.
- the former does not use the local script, but instead it downloads a copy of
- as explained here https://github.com/huggingface/datasets/issues/4725#issuecomment-1191040245:
He suggests adding a more clear explanation about this. He suggests adding it maybe in Installation > source)
CC: @stevhliu
Thanks for the feedback!
I think since this issue is closely related to loading, I can add a clearer explanation under Load > local loading script.
That makes sense but I think having a line about it under https://huggingface.co/docs/datasets/installation#source the "source" header here would be useful. My mental model of pip install -e . does not include the fact that the source files aren't actually being used.
Thanks for sharing your perspective. I think the load_dataset function is the only one that pulls from GitHub, and since this use-case is very specific, I don't think we need to include such a broad clarification in the Installation section.
Feel free to check out the linked PR and let me know if it needs any additional explanation 😊