metaflow-service icon indicating copy to clipboard operation
metaflow-service copied to clipboard

[Metaflow UI] stdout and stderr logs timeout/fail to load

Open martinbattentive opened this issue 8 months ago • 3 comments

When using the Metaflow UI the stdout/stderr panes no longer successfully load, and the requests to load them return with a 504 gateway timeout.

Screenshot 2023-10-17 at 2 36 28 PM

Example url being requested by UI for stderr logs: /api/flows/<flow_name>/runs/59510/steps/start/tasks/539228/logs/err?attempt_id=0&_limit=500&_page=1&_order=-row

I believe the issue is caused by a very expensive join query in async def get_task_by_request(self, request): in ui_backend_service/api/log.py. Looking at the code, this function call and underlying join query seems unnecessary given that the UI is already passing all the task parameters necessary to uniquely identify the task in the Task table directly, including attempt.

martinbattentive avatar Oct 17 '23 21:10 martinbattentive

@saikonen Draft PR https://github.com/Netflix/metaflow-service/pull/394/files

martinbattentive avatar Oct 20 '23 00:10 martinbattentive

@saikonen @savingoyal My initial belief was incorrect. This was actually caused by the log CacheAsyncClient and/or CacheAsyncServer getting into a bad state where it would internally fetch the logs but never return them, leading to the list of pending streams continually building. A restart of the ui_backend service resolved the issue.

martinbattentive avatar Oct 25 '23 22:10 martinbattentive