codeflare-sdk icon indicating copy to clipboard operation
codeflare-sdk copied to clipboard

job.status and job.log returns error for certain domain name pattern

Open tedhtchang opened this issue 2 years ago • 4 comments

The job.status() and job.logs() Are failing, they're showing

gaierror                                  Traceback (most recent call last)
File /opt/app-root/lib64/python3.8/site-packages/urllib3/connection.py:174, in HTTPConnection._new_conn(self)
    173 try:
--> 174     conn = connection.create_connection(
    175         (self._dns_host, self.port), self.timeout, **extra_kw
    176     )
    178 except SocketTimeout:
....
ConnectionError: Failed to connect to Ray at address: http://ray/.

Cause: In torchx The app_handle is a string. On my cluster it looks like ray://torchx/ray-dashboard-mnisttest-default.tedchang-codeflare-test-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.us-south.containers.appdomain.cloud-mnisttest-trwmgjb2s041jc

This method uses a regx that not parse the app_id(mnisttest-trwmgjb2s041jc) and ray dashboard(ray-dashboard-mnisttest-default.tedchang-codeflare-test-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.us-south.containers.appdomain.cloud) property ":\d+-|.com-|.org-|.net-|.gov-|.io-"

/opt/app-root/lib64/python3.8/site-packages/torchx/schedulers/ray_scheduler.py

330         def _parse_app_id(self, app_id: str) -> Tuple[str, str]:
331             # find index of '-' in the first :\d+-
332             m = re.search(r":\d+-|.com-|.org-|.net-|.gov-|.io-" , app_id)
333             if m:
334                 sep = m.span()[1]
335                 addr = app_id[: sep-1]
336                 app_id = app_id[sep:]
337                 return addr, app_id
338 ~~~~~~~~~~~~~~~~
339             addr, _, app_id = app_id.partition("-")
340             return addr, app_id

something like this worked for my cluster but may not be robust ":\d+-|.com-|.org-|.net-|.gov-|.io-|.cloud-"

tedhtchang avatar Jun 08 '23 16:06 tedhtchang

@KPostOffice @jbusche @MichaelClifford Our _parse_app_id looks different from the one in the upstream. Is it a fork which we can make it work for now ?

tedhtchang avatar Jun 08 '23 16:06 tedhtchang

The one upstream was more broken than the one in our custom version. We can definitely make changes, I'm not sure how to make it more generic exactly. Maybe something including an reverse parse for . since I think the name cannot have periods in it, so the from last period to the - must indicate the top level domain.

KPostOffice avatar Jun 08 '23 18:06 KPostOffice

@KPostOffice Reverse parsing sounds like a great idea!. Where is the custom torchx repo located ? I will see if can create a PR there.

tedhtchang avatar Jun 09 '23 21:06 tedhtchang

Thanks for pointing this out @tedhtchang. I recall we ran into this issue before, and we could not determine a regex that provide stable results, so we added a list of reasonable suffixes.

Likely worth revisiting to see if we can implement a general solution.

In the mean time a quick fix would be just to add .cloud- to our fork of torchx (https://github.com/project-codeflare/torchx/tree/OCP)

MichaelClifford avatar Jun 12 '23 14:06 MichaelClifford