datachain Improve file path validation

Validate File.path on usage (caching, download). More validation cases in test.

May 20 '25 02:05 dreadatour

can it be expensive? are we sure it won't be doing some syscalls underneath all these normalize, as_posix - etc

In this PR I am also replacing pathlib.Path with pathlib.PurePath, which main difference is it does not resolve files and does not make any real filesystem operations, pure strings manipulations only.

https://docs.python.org/3/library/pathlib.html

Path classes are divided between pure paths, which provide purely computational operations without I/O, and concrete paths, which inherit from pure paths but also provide I/O operations.

So it should not make any syscalls.

I'm worried it can slow down some bulk massive operations - like listing creation for some very rarely needed use case

Very good point IMO.

May 20 '25 02:05 dreadatour

Very good point IMO.

anything we can do to do a very basic quick test? e.g. starts with . or .. or /? etc ... something really quick that would tell if we need to do a more complicated test?

May 20 '25 03:05 shcheklein

anything we can do to do a very basic quick test? e.g. starts with . or .. or /? etc ... something really quick that would tell if we need to do a more complicated test?

$ python -m timeit -s "from datachain import File" "File.validate_path('./foo/../bar/../file.ext')"
100000 loops, best of 5: 3.14 usec per loop

May 20 '25 03:05 dreadatour

I am more worrying about modifying file path (see tests in this PR).

Is this OK to modify path (e.g. dir/../file.ext -> file.ext or should we raise an exception if path is not normalized? I'd prefer second option, to be honest, but this can breaks user experience. Same time modifying path is also not good for user experience.

May 20 '25 03:05 dreadatour

Is this OK to modify path

tbh, i think that's fine. why not? File is meant to be an actual precise object more or less / vs some random paths

May 20 '25 03:05 shcheklein

Also it looks like we should allow absolute paths, it is how it works now for file:// source.

I am going to remove the check for absolute path, but ideally we should also check source and allow absolute path only if source.startswith("file://") and disallow it in other cases.

Other option is to move root directory (even /) in source field and for all sources always use relative paths in path field, but this is not backward-compatible and will require migrations.

May 20 '25 03:05 dreadatour

Is this OK to modify path

tbh, i think that's fine. why not? File is meant to be an actual precise object more or less / vs some random paths

My concern is only about user having normalized path in the dataset, even if they are creating File signal with some random weird paths. On the other hand you're right, this is "an actual precise object" and normalizing path looks like an only option.

May 20 '25 03:05 dreadatour

a completely alternative option - allow any path at all, don't validate

validate in prefetch instead (our code that deals with these files)

let users also deal with any custom logic that would like to put there

wdyt?

May 20 '25 03:05 shcheklein

One more additional sanity check we can do is to check if path is not ends with trailing slash (/), which represents directories and File model is for file objects.

May 20 '25 03:05 dreadatour

a completely alternative option - allow any path at all, don't validate

validate in prefetch instead (our code that deals with these files)

let users also deal with any custom logic that would like to put there

Yes, this is what my concern was about when I was talking about "user experience".

wdyt?

I think this looks like a good option.

Should I add something like normpath method/property to the File model and check if we are using this new normalized path everywhere we are working with physical files in our codebase?

May 20 '25 03:05 dreadatour

Should I add something like normpath method/property to the File model and check if we are using this new normalized path everywhere we are working with physical files in our codebase?

yes, kinda use your validation logic

but also we need to put safeguards in UDFs, right? For local paths. E.g. do we even need prefetch if file is local?

May 20 '25 03:05 shcheklein

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`6ef2a12`
Status:	✅ Deploy successful!
Preview URL:	https://c9e99731.datachain-documentation.pages.dev
Branch Preview URL:	https://better-file-path-validation.datachain-documentation.pages.dev

View logs

May 20 '25 16:05 cloudflare-workers-and-pages[bot]

E.g. do we even need prefetch if file is local?

We don't need prefetch or cache (? different disks/file systems?) if file is local, but I think this is a subject for separate PR?

May 20 '25 16:05 dreadatour

if file is local, but I think this is a subject for separate PR?

yes, unless it is just easier to do that vs other types of safeguards (e.g. deny any local operations for prefetch / cache for now).

in reality it might be needed though in local mode - e.g. slow NAS mounted on some volume

May 20 '25 16:05 shcheklein

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 88.63%. Comparing base (49eb59a) to head (6ef2a12). Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1110      +/-   ##
==========================================
+ Coverage   88.61%   88.63%   +0.01%     
==========================================
  Files         148      148              
  Lines       12864    12881      +17     
  Branches     1810     1814       +4     
==========================================
+ Hits        11400    11417      +17     
  Misses       1039     1039              
  Partials      425      425

Flag	Coverage Δ
datachain	`88.56% <100.00%> (+0.01%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/cache.py	`88.75% <100.00%> (ø)`
src/datachain/client/fsspec.py	`92.72% <100.00%> (+0.02%)`	:arrow_up:
src/datachain/client/local.py	`97.33% <100.00%> (ø)`
src/datachain/lib/arrow.py	`98.77% <100.00%> (ø)`
src/datachain/lib/file.py	`90.86% <100.00%> (+0.36%)`	:arrow_up:
src/datachain/lib/tar.py	`100.00% <ø> (ø)`
src/datachain/lib/webdataset.py	`94.44% <100.00%> (ø)`

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

May 21 '25 04:05 codecov[bot]

in reality it might be needed though in local mode - e.g. slow NAS mounted on some volume

Yes, exactly. This is still local file, but cache/prefetch might be useful in this case. User can control this with settings, so may be we should leave it as is for now.

May 21 '25 05:05 dreadatour

What will users see when they run UDF with some bad files after this change?

They will see an exception:

datachain.lib.file.FileError: Error in file gs://datachain-test-vlad/.: path must not be a directory

There is another issue, not directly related: https://github.com/iterative/datachain/issues/1125

May 29 '25 15:05 dreadatour