datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Document better that loading a dataset passing its name does not use the local script

Open albertvillanova opened this issue 3 years ago • 3 comments

As reported by @TrentBrick here https://github.com/huggingface/datasets/issues/4725#issuecomment-1191858596, it could be more clear that loading a dataset by passing its name does not use the (modified) local script of it.

What he did:

  • he installed datasets from source
  • he modified locally datasets/the_pile/the_pile.py loading script
  • he tried to load it but using load_dataset("the_pile") instead of load_dataset("datasets/the_pile")
    • as explained here https://github.com/huggingface/datasets/issues/4725#issuecomment-1191040245:
      • the former does not use the local script, but instead it downloads a copy of the_pile.py from our GitHub, caches it locally (inside ~/.cache/huggingface/modules) and uses that.

He suggests adding a more clear explanation about this. He suggests adding it maybe in Installation > source)

CC: @stevhliu

albertvillanova avatar Jul 22 '22 06:07 albertvillanova

Thanks for the feedback!

I think since this issue is closely related to loading, I can add a clearer explanation under Load > local loading script.

stevhliu avatar Jul 22 '22 15:07 stevhliu

That makes sense but I think having a line about it under https://huggingface.co/docs/datasets/installation#source the "source" header here would be useful. My mental model of pip install -e . does not include the fact that the source files aren't actually being used.

TrentBrick avatar Jul 25 '22 19:07 TrentBrick

Thanks for sharing your perspective. I think the load_dataset function is the only one that pulls from GitHub, and since this use-case is very specific, I don't think we need to include such a broad clarification in the Installation section.

Feel free to check out the linked PR and let me know if it needs any additional explanation 😊

stevhliu avatar Aug 01 '22 20:08 stevhliu