clearml icon indicating copy to clipboard operation
clearml copied to clipboard

Relative path is incorrect when using `Dataset.add_external_files` for local file without the `file://` prefix

Open geometrikal opened this issue 1 year ago • 9 comments

Describe the bug

Dataset.add_external_file will allow adding external files from a source url without the file:// prefix, e.g. /path/to/my/file.csv. It will create the link correctly in the metadat.json by prepending file:// however it will remove characters from the file name when creating the relative link.

To reproduce

dataset = Dataset.create(
        dataset_project="tes",
        dataset_name="test"
)
dataset.add_external_files("/mnt/f/0123456789.txt")
dataset.finalize(auto_upload=True)
print(dataset.list_files())
# ['789.txt']

Expected behaviour

Either throw an exception or correctly create the relative path

Environment

  • Server type self hosted
  • ClearML SDK Version 1.16.3
  • ClearML Server Version 1.16.1-499 • API: 2.30
  • Python Version 3.10
  • OS Linux

geometrikal avatar Aug 14 '24 02:08 geometrikal

Hi @geometrikal ! Thanks for reporting this. We will get back to you once this is fixed.

Hi @eugen-ajechiloae-clearml i would like to work on this. Can i pick it up. I am new to the repo, can you please help with some guidance on what has to be done here.

d-vignesh avatar Sep 04 '24 15:09 d-vignesh

Hi @d-vignesh ! Yes, all contributions are welcome, and no-one picked this one up as far as I'm aware. It looks like when adding a file using add_external_files, the file name will be stripped of the first len("file://") characters. I I believe that we don't really support paths that are not prefixed by file:// at all, so, in this function: https://github.com/allegroai/clearml/blob/b1e39e68973182d66b43edc502b58d955b5bc779/clearml/datasets/dataset.py#L3322 I would add a check: if the file is a local path without file:// then get the absolute path and add the file:// prefix. This should be tested using absolute paths, relative paths, on both Windows and Linux/MacOS.

Hi @eugen-ajechiloae-clearml can be please clarify on below doubt, when a source_url without file:// prefix is provided,

  1. the code defaults to file:// as base url and hence the metadata is having the expected name value
  2. during constructing the relative path
link = remote_object.get("name")
relative_path = link[len(source_url):]
if not relative_path:
     relative_path = source_url.split("/")[-1]

for example, if the source_url is /Users/docs/file1.txt, the link value will be file://Users/docs/file1.txt. Hence the relative path will be assigned as le1.txt in line 2. Instead if we use relative_path = source_url.split("/")[-1] always we should be getting the correct relative path. What scenario is the line relative_path = link[len(source_url):] handling in this case? image

d-vignesh avatar Sep 06 '24 17:09 d-vignesh

Hi @d-vignesh ! There are 2 cases when adding external files: you can either add a "directory" or a direct path. In case the source URL is the direct path, we fallback to relative_path = source_url.split("/")[-1]

thank you @eugen-ajechiloae-clearml , got it. Please verify my understanding, that user can give both absolute path /User/project/file1.txt or relative path ./file1.txt if in project directory. In relative path case we need to get the absolute path as /User/project/file1.txt . Then finally append the file:// prefix. This modified path will then be used as the source_url.

d-vignesh avatar Sep 09 '24 16:09 d-vignesh

Assuming the above understanding is right, the below logic is working for all scenarios

source_url = '../data/file2.txt'
parsed = urlparse(source_url)
if parsed.scheme == "":
    abs_path = os.path.abspath(source_url)
    source_url = f"file://{abs_path}"
print(source_url)

Adding this check at the beginning of the add_external_file function seems to resolve the issue. Am i missing any case? image

d-vignesh avatar Sep 09 '24 17:09 d-vignesh

Hi @eugen-ajechiloae-clearml , created a PR for the fix https://github.com/allegroai/clearml/pull/1326. Can you please have a look at it.

d-vignesh avatar Sep 15 '24 16:09 d-vignesh

Hey @geometrikal! Just letting you know that this issue has been resolved in v2.0.0. Let us know if there are any issues :)

pollfly avatar May 22 '25 11:05 pollfly