metaflow-service
metaflow-service copied to clipboard
[Metaflow UI] stdout and stderr logs timeout/fail to load
When using the Metaflow UI the stdout/stderr panes no longer successfully load, and the requests to load them return with a 504 gateway timeout.
Example url being requested by UI for stderr logs: /api/flows/<flow_name>/runs/59510/steps/start/tasks/539228/logs/err?attempt_id=0&_limit=500&_page=1&_order=-row
I believe the issue is caused by a very expensive join query in async def get_task_by_request(self, request):
in ui_backend_service/api/log.py. Looking at the code, this function call and underlying join query seems unnecessary given that the UI is already passing all the task parameters necessary to uniquely identify the task in the Task table directly, including attempt.
@saikonen Draft PR https://github.com/Netflix/metaflow-service/pull/394/files
@saikonen @savingoyal My initial belief was incorrect. This was actually caused by the log CacheAsyncClient and/or CacheAsyncServer getting into a bad state where it would internally fetch the logs but never return them, leading to the list of pending streams continually building. A restart of the ui_backend service resolved the issue.