clearml icon indicating copy to clipboard operation
clearml copied to clipboard

`api.files_server` without a port causes pipeline to fail

Open lastsecondsave opened this issue 9 months ago • 4 comments

Describe the bug

I'm trying to run a pipeline with this step:

pipeline.add_function_step(
    name="some_work",
    task_type=TaskTypes.data_processing,
    function=some_work,
    function_kwargs={"x": ["y", "z"]},
)

The function is not relevant, it is not being called. When an agent picks up the step, it fails with:

2024-05-07 16:13:07,837 - clearml.storage - WARNING - Failed getting object size: ValueError('Failed getting object :443/Project/.pipelines/Test/Test%20%252321.82408167353a45778399f48c543be2a5/artifacts/some_work.name/some_work.name.pkl (404): NOT FOUND')
2024-05-07 16:13:08,016 - clearml.storage - ERROR - Could not download https://files.clearml.xxxxx.net:443/Project/.pipelines/Test/Test%20%252321.82408167353a45778399f48c543be2a5/artifacts/some_work.name/some_work.name.pkl , err: Failed getting object :443/Project/.pipelines/Test/Test%20%252321.82408167353a45778399f48c543be2a5/artifacts/some_work.name/some_work.name.pkl (404): NOT FOUND 
Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.10/code/some_work.py", line 34, in <module>
    kwargs[k] = parent_task.artifacts[artifact_name].get(deserialization_function=None)
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/clearml/binding/artifacts.py", line 171, in get
    local_file = self.get_local_copy(raise_on_error=True, force_download=force_download)
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/clearml/binding/artifacts.py", line 244, in get_local_copy
    raise ValueError(
ValueError: Could not retrieve a local copy of artifact some_work.name, failed downloading https://files.clearml.xxxxx.net:443/Project/.pipelines/Test/Test%20%252321.82408167353a45778399f48c543be2a5/artifacts/some_work.name/some_work.name.pkl
2024-05-07 18:13:24
Process failed, exit code 1

URL mentioned is correct and the file can be downloaded with a browser.

To reproduce

This is how ClearML is deployed in our environment:

api {
  web_server: https://clearml.xxxxx.net
  api_server: https://api.clearml.xxxxx.net
  files_server: https://files.clearml.xxxxx.net
}

The problem above disappears if I explicitly set the port for the files_server:

files_server: https://files.clearml.xxxxx.net:443

My wild guess is here you create an object's name from the url. This makes the port a part of the name (you can see it in the log). And here you reconstruct the url, which probably will look like https://files.clearml.xxxxx.net/:443/.... And this one causes 404.

Expected behaviour

URLs without ports should not cause any issues.

Environment

  • Server type: self hosted
  • ClearML SDK Version: 1.15.1
  • ClearML Server Version: 1.14.0-431
  • Python Version: 3.10
  • OS: Linux

lastsecondsave avatar May 08 '24 13:05 lastsecondsave

Hi @lastsecondsave, I'm not sure I understand - the error you show mentions port 443 - this means the 443 port appears in the URL registered for this artifact (it's explicitly written in the DB), so the problem seem to be that the URLs you're trying to access have the 443 port, while the files_server setting does not have it. This raises the question, how were these URLs created, and the only way I can think of is that at some point the clearml.conf configuration file did contain files_server with the port, and at that point the files were uploaded...

jkhenning avatar May 08 '24 15:05 jkhenning

They were created by the another agent. So the agent one has port 443 explicitly set, and the pipeline starts on it:

pipeline.start(queue="one")

The task is being scheduled on the agent two, its config has no port in it:

pipeline.add_function_step(
    execution_queue="two",
    ...)

My assumption was that the first agent has nothing to do with the problem, since it already had a "workaround". And seems like my yesterday debug session was accidentally caused by someone who initially configured it).

Still, the presence or absence of the default ports should not cause anomalies.

lastsecondsave avatar May 08 '24 16:05 lastsecondsave

Still, the presence or absence of the default ports should not cause anomalies

I'll have to disagree on the last one - in general it's possible to have several different services on different ports, which is why the SDK uses the exact service endpoint to loop for configured credentials - I'm not sure why different clients (agent, SDK) should have different endpoints defined (with or without port) - you should simply decide on one and use it consistently

jkhenning avatar May 09 '24 06:05 jkhenning

It's not possible to have different services on https://example.com and https://example.com:443, right? The scheme part of the URL plays its role. Given that's how the whole internet works, nobody will expect this to cause any issues. And it's not like your error messages help in this case. At least verify that addresses do not exactly match and write the correct error. My case may be a bit dumb, but your product should be foolproof.

lastsecondsave avatar May 09 '24 11:05 lastsecondsave