airflow
airflow copied to clipboard
Adding Pagination functionality
Working to add Pagination Feature onto detailed DAG logs - WIP
Continues and solves feature from this Issue
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.
This meant to solve https://github.com/apache/airflow/issues/33625 right?
This meant to solve #33625 right?
Interesting, I've never seen that. It does sound related, however this is something that @RNHTTR implemented, and it seems the API endpoint was created from scratch. I'll talk to him to see if there's any similarities/if it is related to this problem.
This meant to solve #33625 right?
Yes, I think so, but specifically for task instance logs served from blob storage.
let me highlight some complexities here.
when reading logs for a running tasks, the logs could be various locations:
- locally
- blob storage
- served from triggerer
- served from worker
- read from stdout (k8s exec)
And it could be those things in various combinations e.g. if there was a deferral, or if logs were shipped using the task context logger.
The log reading process reads from all of these sources, interleaves the log lines, and dedupes.
@dstandish I believe this would only apply to completed tasks? I don't think running tasks serve logs from blob storage
@dstandish I believe this would only apply to completed tasks? I don't think running tasks serve logs from blob storage
i don't see anything here that suggests this conditionally enables pagination only when task is done (though that might be a reasonable type of thing to consider)
re the other part of your question, if a task is deferred, then the triggerer logs may be read from logs server on triggerer, while the worker logs from the first phase of execution would already be sitting in s3.
@dstandish I believe this would only apply to completed tasks? I don't think running tasks serve logs from blob storage
i don't see anything here that suggests this conditionally enables pagination only when task is done (though that might be a reasonable type of thing to consider)
re the other part of your question, if a task is deferred, then the triggerer logs may be read from logs server on triggerer, while the worker logs from the first phase of execution would already be sitting in s3.
@RNHTTR and I has discussed about doing state checks to only run this API when it's executed, failed, or deferred (basically when the logs have terminated)
I would love to get your advice on the workflow for this @uranusjr
What I am thinking right now is creating an interface that abstracts the specifics of each storage service. It will essentially have each storage type that we support (S3, GCS, Azure). So when our API setup uses TaskLogReader.get_log_content, it wouldn't worry about which storage backend it's interacting with, and it'll be generic. Is that a good idea or is there a more efficient idea? What I currently have is a separate API endpoint that obtains only from S3hook for now. Should we keep this endpoint or should I attempt to reduce the number of requests we're making, and just pass it through the response of get_logs. To simplify my question, what's a good approach on obtaining file_size in the frontend?
My initial thoughts is to integrate and put get_log_page_number together into the get_logs method, which would remove the initial HEAD request. I think that it is redundant to have both get_logs and a request specifically to obtain log size when we can just return the file_size in the get_log response. However, I'm not too sure if that's efficient.
Another issue is that there are errors right now regarding the newly added offset/limit parameter for the providers. That's something that we'll also have to look into
I would love to get your advice on the workflow for this @uranusjr
What I am thinking right now is creating an interface that abstracts the specifics of each storage service. It will essentially have each storage type that we support (S3, GCS, Azure). So when our API setup uses
TaskLogReader.get_log_content, it wouldn't worry about which storage backend it's interacting with, and it'll be generic. Is that a good idea or is there a more efficient idea? What I currently have is a separate API endpoint that obtains only from S3hook for now. Should we keep this endpoint or should I attempt to reduce the number of requests we're making, and just pass it through the response ofget_logs. To simplify my question, what's a good approach on obtaining file_size in the frontend?My initial thoughts is to integrate and put
get_log_page_numbertogether into theget_logsmethod, which would remove the initial HEAD request. I think that it is redundant to have bothget_logsand a request specifically to obtain log size when we can just return thefile_sizein theget_logresponse. However, I'm not too sure if that's efficient.Another issue is that there are errors right now regarding the newly added offset/limit parameter for the providers. That's something that we'll also have to look into
I'll first fix the other tests that are breaking, aka the providers (after I added offset/limit functionality)
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.