airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Dataset List View

Open blag opened this issue 2 years ago • 2 comments

This PR fleshes out the dataset list view, displaying the last update datetime of each dataset.

The endpoint can also sort by last_dataset_update (latest first or oldest first) or by the dataset URI (ascending or descending). By default, datasets without any updates are displayed first, and then datasets with the latest updates are displayed (eg: nulls first, descending by update time). I pushed the nulls to the top because I think it might be more useful for people who are investigating why a DAG is not being triggered by a dataset, so it might be more helpful to have datasets that have never been updated as easily accessible as possible. However, if this is a contentious choice, I'm happy to revert that change.

@jedcunningham Did almost all of the work on this, I just added some final touches.

Before

dataset_list_before

After

dataset_list_after


^ Add meaningful description above

Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

blag avatar Sep 12 '22 23:09 blag

Could you include before/after screenshots in the PR desc please?

ashb avatar Sep 13 '22 18:09 ashb

Updated description with screenshots and giving credit where credit is due.

blag avatar Sep 13 '22 21:09 blag

Sorting is looking good!

Although there seems to be an issue with the datasets list endpoint. For some reason, for one of my datasets the total update count is way off. It says there are 288 events, but there are only 24 associated dag runs and 24 events returned from the dataset events API:

Screen Shot 2022-09-22 at 8 52 18 AM Screen Shot 2022-09-22 at 8 52 34 AM

bbovenzi avatar Sep 22 '22 13:09 bbovenzi

Actually, I'm don't think total updates count is even useful. I'd say lets just remove it entirely and have this PR just be about adding the last updated at info. Or put it into draft until I talk to some users and come up with a better solution

bbovenzi avatar Sep 23 '22 16:09 bbovenzi

@bbovenzi I fixed the issues with the aggregate counts (on PostgreSQL) and expanded the tests a little bit to include more DAG <-> dataset relationships to exercise those SQL paths. Still need to fix things on...the rest of the supported database backends.

blag avatar Sep 24 '22 01:09 blag

I changed last updated to its own column that the user can sort by.

Screen Shot 2022-09-27 at 9 37 00 AM

bbovenzi avatar Sep 27 '22 13:09 bbovenzi

I haven't had a chance to move to 2.4.0 yet. Is there a way to filter datasets by tags, potentially? I normally tag different dags with the project/client name. It would be cool to filter for a single "project" (consisting of multiple DAGs) and see all of the datasets and how they interconnect.

ldacey avatar Oct 04 '22 00:10 ldacey

I haven't had a chance to move to 2.4.0 yet. Is there a way to filter datasets by tags, potentially?

No, we haven't implemented tagging datasets yet. That's a good idea though.

see all of the datasets and how they interconnect

This page does have a good dataset/dag dependency visualization, so definitely check that out once you upgrade to 2.4!

blag avatar Oct 04 '22 00:10 blag

I haven't had a chance to move to 2.4.0 yet. Is there a way to filter datasets by tags, potentially? I normally tag different dags with the project/client name.

After this PR merges, I plan to add more searching and filtering. Adding tags could be a good idea.

It would be cool to filter for a single "project" (consisting of multiple DAGs) and see all of the datasets and how they interconnect.

Yes, that is also on my to-do list. Right now the page can get too busy very quickly.

bbovenzi avatar Oct 04 '22 13:10 bbovenzi

Nice, that would be great. A lot of our clients are completely distinct - if I am showing a demo or screenshot for "Client A", I do not want to show "Client B" at all, and I might want to see how all datasets interconnect for "Client A". There might be other approaches, but I have used tags to filter my DAGs UI view historically.

ldacey avatar Oct 05 '22 16:10 ldacey