airflow Adding Pagination functionality

Working to add Pagination Feature onto detailed DAG logs - WIP

Continues and solves feature from this Issue

^ Add meaningful description above Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Jun 26 '24 22:06 andyjianzhou

This meant to solve https://github.com/apache/airflow/issues/33625 right?

Jun 27 '24 03:06 eladkal

This meant to solve #33625 right?

Interesting, I've never seen that. It does sound related, however this is something that @RNHTTR implemented, and it seems the API endpoint was created from scratch. I'll talk to him to see if there's any similarities/if it is related to this problem.

Jun 28 '24 05:06 andyjianzhou

This meant to solve #33625 right?

Yes, I think so, but specifically for task instance logs served from blob storage.

Jun 28 '24 16:06 RNHTTR

let me highlight some complexities here.

when reading logs for a running tasks, the logs could be various locations:

locally
blob storage
served from triggerer
served from worker
read from stdout (k8s exec)

And it could be those things in various combinations e.g. if there was a deferral, or if logs were shipped using the task context logger.

The log reading process reads from all of these sources, interleaves the log lines, and dedupes.

Jun 28 '24 18:06 dstandish

@dstandish I believe this would only apply to completed tasks? I don't think running tasks serve logs from blob storage

Jun 28 '24 19:06 RNHTTR

@dstandish I believe this would only apply to completed tasks? I don't think running tasks serve logs from blob storage

i don't see anything here that suggests this conditionally enables pagination only when task is done (though that might be a reasonable type of thing to consider)

re the other part of your question, if a task is deferred, then the triggerer logs may be read from logs server on triggerer, while the worker logs from the first phase of execution would already be sitting in s3.

Jun 28 '24 22:06 dstandish

@dstandish I believe this would only apply to completed tasks? I don't think running tasks serve logs from blob storage

i don't see anything here that suggests this conditionally enables pagination only when task is done (though that might be a reasonable type of thing to consider)

re the other part of your question, if a task is deferred, then the triggerer logs may be read from logs server on triggerer, while the worker logs from the first phase of execution would already be sitting in s3.

@RNHTTR and I has discussed about doing state checks to only run this API when it's executed, failed, or deferred (basically when the logs have terminated)

Jul 02 '24 21:07 andyjianzhou

I would love to get your advice on the workflow for this @uranusjr

What I am thinking right now is creating an interface that abstracts the specifics of each storage service. It will essentially have each storage type that we support (S3, GCS, Azure). So when our API setup uses TaskLogReader.get_log_content, it wouldn't worry about which storage backend it's interacting with, and it'll be generic. Is that a good idea or is there a more efficient idea? What I currently have is a separate API endpoint that obtains only from S3hook for now. Should we keep this endpoint or should I attempt to reduce the number of requests we're making, and just pass it through the response of get_logs. To simplify my question, what's a good approach on obtaining file_size in the frontend?

My initial thoughts is to integrate and put get_log_page_number together into the get_logs method, which would remove the initial HEAD request. I think that it is redundant to have both get_logs and a request specifically to obtain log size when we can just return the file_size in the get_log response. However, I'm not too sure if that's efficient.

Another issue is that there are errors right now regarding the newly added offset/limit parameter for the providers. That's something that we'll also have to look into

Jul 18 '24 17:07 andyjianzhou

I would love to get your advice on the workflow for this @uranusjr

What I am thinking right now is creating an interface that abstracts the specifics of each storage service. It will essentially have each storage type that we support (S3, GCS, Azure). So when our API setup uses TaskLogReader.get_log_content, it wouldn't worry about which storage backend it's interacting with, and it'll be generic. Is that a good idea or is there a more efficient idea? What I currently have is a separate API endpoint that obtains only from S3hook for now. Should we keep this endpoint or should I attempt to reduce the number of requests we're making, and just pass it through the response of get_logs. To simplify my question, what's a good approach on obtaining file_size in the frontend?

My initial thoughts is to integrate and put get_log_page_number together into the get_logs method, which would remove the initial HEAD request. I think that it is redundant to have both get_logs and a request specifically to obtain log size when we can just return the file_size in the get_log response. However, I'm not too sure if that's efficient.

Another issue is that there are errors right now regarding the newly added offset/limit parameter for the providers. That's something that we'll also have to look into

I'll first fix the other tests that are breaking, aka the providers (after I added offset/limit functionality)

Jul 18 '24 17:07 andyjianzhou

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

Sep 24 '24 18:09 github-actions[bot]

airflow airflow copied to clipboard

Adding Pagination functionality

airflow
airflow copied to clipboard