airflow-client-python icon indicating copy to clipboard operation
airflow-client-python copied to clipboard

TaskInstances- "get instance batch" is getting timeout after 1 minute, status code- 504 (~1000 records)

Open litalkat opened this issue 1 year ago • 4 comments

is there a way to increase this timeout? Does pulling data for other airflow instances affect it? I tried to use _request_timeout=180 but it didn't help

litalkat avatar Sep 22 '23 06:09 litalkat

Hey @litalkat, we're likely going to need a little more info for this. Is there any stack trace with the error or any indication of a line number we can use to track it down?

ferruzzi avatar Sep 22 '23 14:09 ferruzzi

@ferruzzi sure. the request is for specific airflow deployment. in general- the airflow deployment is running ~100K airflow tasks per day. the process already run at least 5000 times In the last 3 months. I already countered this "504 Service Exception" response when I made an API call for long-range date but I saw that it can handle a 7000-9000 records in the response. in the last two days I get this 504 exception almost for any call (even for a range of 3 minutes -> ~100 tasks response)

its important to mention that I am monitoring a lot of other airflow instances in the same time but I didn't saw any thing about rate limit

the exception is returning EXACTLY 1 min after the API call is sent the traceback looks like: (504) Reason: Gateway Timeout HTTP response headers: HTTPHeaderDict({'x-powered-by': 'Express', ..... HTTP response body: Error accured while trying to proxy: some private env name/api/v1/dags/~/dagRuns/~/taskInstances/list

the request is using the airflow_client.get_task_instances_batch(ListTaskInstancesForm(...),_check_return_type=False)

litalkat avatar Sep 23 '23 14:09 litalkat

@ferruzzi any updates? or ideas how to solve it?

Yaadto avatar Sep 27 '23 12:09 Yaadto

This is due to the webserver timeout. More info here: https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#web-server-master-timeout.

Maybe we need to improve performance of this particular endpoint, but if you have a really huge task instance table, this is king of expected I believe.

pierrejeambrun avatar Oct 08 '23 20:10 pierrejeambrun