Add function to retrieve example dataset paths
Done:
- [x] Closes #924
- [x] I am familiar with the contributing guidelines
- [x] Tests added
- [x] New code is fully documented. Includes numpydoc compliant docstrings, examples, and comments where necessary.
- [x] Pull request is nearly complete and ready for detailed review.
- [x] Updates entries in
docs/sphinx/source/referencefor API changes. ~About this one, there isn't atools.rstfile, so I believe it doesn't need referencing.~
I'll wait for milestone assignment:
- [x] Adds description and name entries in the appropriate "what's new" file in
docs/sphinx/source/whatsnewfor all changes. Includes link to the GitHub Issue with:issue:`num`or this Pull Request with:pull:`num`. Includes contributor name and/or GitHub username (link with:ghuser:`user`). - [x] Maintainer: Appropriate GitHub Labels (including
remote-data) and Milestone are assigned to the Pull Request and linked Issue.
Adds a get_example_dataset_path() function to pvlib.tools that returns a filepath to a dataset bundled with pvlib. See linked issue above.
Here is the documentation link [UPDATED 2023-06-18].
This issue is related to #1056. However, I haven't addressed any of it since in first place I don't know which purpose each file has. Tweaking this new function would leverage work put into changing the structure.
Here is the documentation link [UPDATED].
This should be ok now. Please have a look at the docs. I've created a table to leverage work on the other issue. Maybe it's better to position it on another paragraph.
Edit: oh, and I ended up adding the parameter, since it won't really impact the use, someone more experienced may be able to organize the data folder.
Good to know about importlib.resources.files! It does require python 3.9 though.
The function name get_test_dataset_path implies that it's only for tests, yet it's used for all 3 of those items.
I suggested including test in the function name in an attempt to prevent confusing it for something like an iotools.get_ function; the thought was to try to make it clear that it's limited to only a select set of special files, not a general purpose data tool. But if we're looking to draw a line between test (as in pvlib.tests) and other types of data then I can see that it could be confusing.
But I don't see why one function for three types of data files might cause problems later on. If all the function does is figure_out_where_pvlib_is() + path_parameter, what's it matter whether the file is LinkeTurbidities.h5 or a testing file?
There's an importlib-resources backport for older python. We already use this approach for importlib-metadata for python < 3.8. This realpython example might be helpful too.
I think your original list of pros to a function are still valid. There the focus is on a function for example code and maybe user onboarding that separates public resources from package structure. If the user-facing function only does that then it might be a local win. On the other hand, it can also be good to propagate standard python patterns in example code so it's less clear that it's a global win. I think the internal code should keep it simple and stick to importlib-resources.
I understand the issue. We can limit the scope of this function until #1056 is addressed. I agree with @wholmgren criteria on that this PR should only satisfy user needs. I will revert unrelated changes.
I suggested including test in the function name in an attempt to prevent confusing it for something like an
iotools.get_function
@kandersolar what is your opinion on renaming it to get_example_dataset_path (or get_pvlib_example_dataset_path, get_bundled_dataset_path, maybe consider other synonym of 'bundled')?
I'd like to bring attention to two points:
- Importlib imports the whole module with
importlib.resources.files('pvlib'). I don't know how much overhead that introduces compared to using__path__[0]or__file__,but I expect it to be a little higher. Not much, but just to consider the micro optimization. - Tests are being shipped with pvlib when it's installed. Is that on purpose?
This should be fine to go. Renamed to get_example_dataset_path and use cases are only in the examples.
Importlib imports the whole module with importlib.resources.files('pvlib'). I don't know how much overhead that introduces compared to using
__path__[0]or__file__, but I expect it to be a little higher. Not much, but just to consider the micro optimization.
It seems to use the same sys.modules cache as import does, so as long as we are assuming pvlib is already imported, or will be imported later, then I don't think there is any significant penalty.
Tests are being shipped with pvlib when it's installed. Is that on purpose?
Yes, see #473. There is a non-pvlib discussion here on whether including tests in sdists and wheels is a good idea in general (tl;dr opinions vary).
That's good to know @kandersolar !
Closing this due to inactivity. If there's interest in the future, feel free to ping me.