[ENH] Replace os.path with Path for dataset directory handling (#3088)
[ENH] Replace os.path with Path for dataset directory handling (#3088)
What does this implement/fix?
This PR modernizes the dataset path-handling logic across the datasets module by replacing legacy os.path usage with pathlib.Path.
Updated files:
aeon/datasets/_data_loaders.pyaeon/datasets/_single_problem_loaders.pyaeon/datasets/dataset_collections.py
Summary of changes
- Added
from pathlib import Pathto each updated file. - Replaced:
os.path.dirname(aeon.__file__)os.path.join(...)
- With:
Path(aeon.__file__).resolve().parent / "datasets"
How this solves issue #3088
The previous implementation relied on os.path.dirname(aeon.__file__), which may resolve to unexpected paths under editable installs, virtual environments, or certain packaging scenarios.
By using Path(aeon.__file__).resolve().parent, the resolved path always points to the actual installed module directory.
This stabilizes dataset directory discovery and prevents issues where datasets are loaded from or written to incorrect locations.
Testing
- All dataset-loading tests passed (
pytest). - Verified dataset loading, regression/classification loading, and forecasting loading manually.
- Pre-commit hooks were run; formatting and imports were corrected automatically.
Does your contribution introduce a new dependency?
No, pathlib is part of the Python standard library.
Any other comments?
This improves cross-platform reliability and aligns the codebase with modern Python best practices.
PR checklist
- [x] I will use
@all-contributorsafter merge. - [x] PR title begins with
[ENH].
Thank you for contributing to aeon
I have added the following labels to this PR based on the title: [ enhancement ]. I have added the following labels to this PR based on the changes made: [ datasets ]. Feel free to change these if they do not properly represent the PR.
The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.
If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.
Don't hesitate to ask questions on the aeon Slack channel if you have any.
PR CI actions
These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.
- [ ] Run
pre-commitchecks for all files - [ ] Run
mypytypecheck tests - [ ] Run all
pytesttests and configurations - [ ] Run all notebook example tests
- [ ] Run numba-disabled
codecovtests - [ ] Stop automatic
pre-commitfixes (always disabled for drafts) - [ ] Disable numba cache loading
- [ ] Regenerate expected results for testing
- [ ] Push an empty commit to re-run CI checks
The CI failures is common to all the PRs. It would be fixed.
Was there an issue with this PR?
Was there an issue with this PR?
Hi, thanks for following up @MatthewMiddlehurst
I closed the previous PR because, after reviewing it again, I realised that the change I made (replacing os.path with Path) did not actually fix the underlying issue raised in #3088
It only modernised the path code but did not change the dataset storage location.
The real problem is that downloaded TSC datasets are currently saved inside the package directory (site-packages/aeon/datasets/data/), which is not ideal because package directories should be treated as read-only and can cause permission issues across environments.
For the proper fix, I’m planning to move the default dataset storage location to a user-writable directory using platformdirs.
On Windows this resolves to something like - C:\Users\satwiksps\AppData\Local\aeon\aeon-toolkit\datasets\ which avoids polluting the installation directory.
Before I start the new PR, could you please confirm that user_data_dir("aeon", "aeon-toolkit") / "datasets" is the preferred default location for TSC dataset downloads ?
I’ll proceed as soon as I have your confirmation.
It would be better to edit your current PR rather than create a new one for changes like this. Rather than AppData it should probably be wherever the code is run from right? i.e. if you run code/my_code.py it should create code/aeon_data/my_data.ts or something. Up for discussion.
If you don't mind answering, why did you initially only change the method to find the path? Just curious whether it was an problem with the issue.
It would be better to edit your current PR rather than create a new one for changes like this. Rather than AppData it should probably be wherever the code is run from right? i.e. if you run
code/my_code.pyit should createcode/aeon_data/my_data.tsor something. Up for discussion.
I’m happy to continue with the existing PR instead of opening a new one. I closed it only because I thought the fix was incomplete and possibly misleading.
About the default location:
I’m open to storing the data relative to the working directory (e.g. ./aeon_data/...) as you suggested. My initial thought was to use a user-writable platform directory (e.g. AppData/Local/aeon/... on Windows, ~/.local/share/aeon/... on Linux) since many ML libraries (e.g., HuggingFace, torchtext) follow that pattern, but using a project-local folder like ./aeon_data also has advantages for reproducibility and portability.
I’m happy to implement this and I’ll update the PR accordingly.
If you don't mind answering, why did you initially only change the method to find the path? Just curious whether it was an problem with the issue.
Initially, I interpreted the issue as primarily a problem with how the dataset directory was being resolved, not where it should live. The use of os.path.dirname(aeon.__file__) sometimes resolves to unexpected paths under editable instals, venvs, or when the package is symlinked.
So my first instinct was to fix the path resolution method to reduce those inconsistencies. Only after digging deeper into the datasets module did I realize the real problem is not about resolution, but about location.
So my first PR only improved the mechanics of path construction, not the logic of choosing the correct base directory. Once I connected that the issue is actually about selecting a new default storage root, I realized the initial fix wasn’t addressing the underlying problem so I stepped back to rethink the design.
Interesting to hear that other packages put it in AppData. May be worth asking on Slack if there is a preference? I think for now if you want to edit the PR just do it locally as its an improvement over the current method, and we can change it later if necessary.