codeflare-sdk
codeflare-sdk copied to clipboard
job.status and job.log returns error for certain domain name pattern
The job.status() and job.logs() Are failing, they're showing
gaierror Traceback (most recent call last)
File /opt/app-root/lib64/python3.8/site-packages/urllib3/connection.py:174, in HTTPConnection._new_conn(self)
173 try:
--> 174 conn = connection.create_connection(
175 (self._dns_host, self.port), self.timeout, **extra_kw
176 )
178 except SocketTimeout:
....
ConnectionError: Failed to connect to Ray at address: http://ray/.
Cause:
In torchx The app_handle is a string. On my cluster it looks like
ray://torchx/ray-dashboard-mnisttest-default.tedchang-codeflare-test-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.us-south.containers.appdomain.cloud-mnisttest-trwmgjb2s041jc
This method uses a regx that not parse the app_id(mnisttest-trwmgjb2s041jc) and ray dashboard(ray-dashboard-mnisttest-default.tedchang-codeflare-test-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.us-south.containers.appdomain.cloud) property
":\d+-|.com-|.org-|.net-|.gov-|.io-"
/opt/app-root/lib64/python3.8/site-packages/torchx/schedulers/ray_scheduler.py
330 def _parse_app_id(self, app_id: str) -> Tuple[str, str]:
331 # find index of '-' in the first :\d+-
332 m = re.search(r":\d+-|.com-|.org-|.net-|.gov-|.io-" , app_id)
333 if m:
334 sep = m.span()[1]
335 addr = app_id[: sep-1]
336 app_id = app_id[sep:]
337 return addr, app_id
338 ~~~~~~~~~~~~~~~~
339 addr, _, app_id = app_id.partition("-")
340 return addr, app_id
something like this worked for my cluster but may not be robust
":\d+-|.com-|.org-|.net-|.gov-|.io-|.cloud-"
@KPostOffice @jbusche @MichaelClifford Our _parse_app_id looks different from the one in the upstream. Is it a fork which we can make it work for now ?
The one upstream was more broken than the one in our custom version. We can definitely make changes, I'm not sure how to make it more generic exactly. Maybe something including an reverse parse for . since I think the name cannot have periods in it, so the from last period to the - must indicate the top level domain.
@KPostOffice Reverse parsing sounds like a great idea!. Where is the custom torchx repo located ? I will see if can create a PR there.
Thanks for pointing this out @tedhtchang. I recall we ran into this issue before, and we could not determine a regex that provide stable results, so we added a list of reasonable suffixes.
Likely worth revisiting to see if we can implement a general solution.
In the mean time a quick fix would be just to add .cloud- to our fork of torchx (https://github.com/project-codeflare/torchx/tree/OCP)