Improve file path validation
can it be expensive? are we sure it won't be doing some syscalls underneath all these normalize, as_posix - etc
In this PR I am also replacing pathlib.Path with pathlib.PurePath, which main difference is it does not resolve files and does not make any real filesystem operations, pure strings manipulations only.
https://docs.python.org/3/library/pathlib.html
Path classes are divided between pure paths, which provide purely computational operations without I/O, and concrete paths, which inherit from pure paths but also provide I/O operations.
So it should not make any syscalls.
I'm worried it can slow down some bulk massive operations - like listing creation for some very rarely needed use case
Very good point IMO.
Very good point IMO.
anything we can do to do a very basic quick test? e.g. starts with . or .. or /? etc ... something really quick that would tell if we need to do a more complicated test?
anything we can do to do a very basic quick test? e.g. starts with . or .. or /? etc ... something really quick that would tell if we need to do a more complicated test?
$ python -m timeit -s "from datachain import File" "File.validate_path('./foo/../bar/../file.ext')"
100000 loops, best of 5: 3.14 usec per loop
I am more worrying about modifying file path (see tests in this PR).
Is this OK to modify path (e.g. dir/../file.ext -> file.ext or should we raise an exception if path is not normalized?
I'd prefer second option, to be honest, but this can breaks user experience. Same time modifying path is also not good for user experience.
Is this OK to modify path
tbh, i think that's fine. why not? File is meant to be an actual precise object more or less / vs some random paths
Also it looks like we should allow absolute paths, it is how it works now for file:// source.
I am going to remove the check for absolute path, but ideally we should also check source and allow absolute path only if source.startswith("file://") and disallow it in other cases.
Other option is to move root directory (even /) in source field and for all sources always use relative paths in path field, but this is not backward-compatible and will require migrations.
Is this OK to modify path
tbh, i think that's fine. why not? File is meant to be an actual precise object more or less / vs some random paths
My concern is only about user having normalized path in the dataset, even if they are creating File signal with some random weird paths. On the other hand you're right, this is "an actual precise object" and normalizing path looks like an only option.
a completely alternative option - allow any path at all, don't validate
validate in prefetch instead (our code that deals with these files)
let users also deal with any custom logic that would like to put there
wdyt?
One more additional sanity check we can do is to check if path is not ends with trailing slash (/), which represents directories and File model is for file objects.
a completely alternative option - allow any path at all, don't validate
validate in prefetch instead (our code that deals with these files)
let users also deal with any custom logic that would like to put there
Yes, this is what my concern was about when I was talking about "user experience".
wdyt?
I think this looks like a good option.
Should I add something like normpath method/property to the File model and check if we are using this new normalized path everywhere we are working with physical files in our codebase?
Should I add something like normpath method/property to the File model and check if we are using this new normalized path everywhere we are working with physical files in our codebase?
yes, kinda use your validation logic
but also we need to put safeguards in UDFs, right? For local paths. E.g. do we even need prefetch if file is local?
Deploying datachain-documentation with
Cloudflare Pages
| Latest commit: |
6ef2a12
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://c9e99731.datachain-documentation.pages.dev |
| Branch Preview URL: | https://better-file-path-validation.datachain-documentation.pages.dev |
E.g. do we even need prefetch if file is local?
We don't need prefetch or cache (? different disks/file systems?) if file is local, but I think this is a subject for separate PR?
if file is local, but I think this is a subject for separate PR?
yes, unless it is just easier to do that vs other types of safeguards (e.g. deny any local operations for prefetch / cache for now).
in reality it might be needed though in local mode - e.g. slow NAS mounted on some volume
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 88.63%. Comparing base (
49eb59a) to head (6ef2a12). Report is 2 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #1110 +/- ##
==========================================
+ Coverage 88.61% 88.63% +0.01%
==========================================
Files 148 148
Lines 12864 12881 +17
Branches 1810 1814 +4
==========================================
+ Hits 11400 11417 +17
Misses 1039 1039
Partials 425 425
| Flag | Coverage Δ | |
|---|---|---|
| datachain | 88.56% <100.00%> (+0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/datachain/cache.py | 88.75% <100.00%> (ø) |
|
| src/datachain/client/fsspec.py | 92.72% <100.00%> (+0.02%) |
:arrow_up: |
| src/datachain/client/local.py | 97.33% <100.00%> (ø) |
|
| src/datachain/lib/arrow.py | 98.77% <100.00%> (ø) |
|
| src/datachain/lib/file.py | 90.86% <100.00%> (+0.36%) |
:arrow_up: |
| src/datachain/lib/tar.py | 100.00% <ø> (ø) |
|
| src/datachain/lib/webdataset.py | 94.44% <100.00%> (ø) |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
in reality it might be needed though in local mode - e.g. slow NAS mounted on some volume
Yes, exactly. This is still local file, but cache/prefetch might be useful in this case. User can control this with settings, so may be we should leave it as is for now.
What will users see when they run UDF with some bad files after this change?
They will see an exception:
datachain.lib.file.FileError: Error in file gs://datachain-test-vlad/.: path must not be a directory
There is another issue, not directly related: https://github.com/iterative/datachain/issues/1125