airflow
airflow copied to clipboard
Dataset List View
This PR fleshes out the dataset list view, displaying the last update datetime of each dataset.
The endpoint can also sort by last_dataset_update
(latest first or oldest first) or by the dataset URI (ascending or descending). By default, datasets without any updates are displayed first, and then datasets with the latest updates are displayed (eg: nulls first, descending by update time). I pushed the nulls to the top because I think it might be more useful for people who are investigating why a DAG is not being triggered by a dataset, so it might be more helpful to have datasets that have never been updated as easily accessible as possible. However, if this is a contentious choice, I'm happy to revert that change.
@jedcunningham Did almost all of the work on this, I just added some final touches.
Before
After
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst
or {issue_number}.significant.rst
, in newsfragments.
Could you include before/after screenshots in the PR desc please?
Updated description with screenshots and giving credit where credit is due.
Sorting is looking good!
Although there seems to be an issue with the datasets list endpoint. For some reason, for one of my datasets the total update count is way off. It says there are 288 events, but there are only 24 associated dag runs and 24 events returned from the dataset events API:


Actually, I'm don't think total updates count is even useful. I'd say lets just remove it entirely and have this PR just be about adding the last updated at info. Or put it into draft until I talk to some users and come up with a better solution
@bbovenzi I fixed the issues with the aggregate counts (on PostgreSQL) and expanded the tests a little bit to include more DAG <-> dataset relationships to exercise those SQL paths. Still need to fix things on...the rest of the supported database backends.
I changed last updated to its own column that the user can sort by.

I haven't had a chance to move to 2.4.0 yet. Is there a way to filter datasets by tags, potentially? I normally tag different dags with the project/client name. It would be cool to filter for a single "project" (consisting of multiple DAGs) and see all of the datasets and how they interconnect.
I haven't had a chance to move to 2.4.0 yet. Is there a way to filter datasets by tags, potentially?
No, we haven't implemented tagging datasets yet. That's a good idea though.
see all of the datasets and how they interconnect
This page does have a good dataset/dag dependency visualization, so definitely check that out once you upgrade to 2.4!
I haven't had a chance to move to 2.4.0 yet. Is there a way to filter datasets by tags, potentially? I normally tag different dags with the project/client name.
After this PR merges, I plan to add more searching and filtering. Adding tags could be a good idea.
It would be cool to filter for a single "project" (consisting of multiple DAGs) and see all of the datasets and how they interconnect.
Yes, that is also on my to-do list. Right now the page can get too busy very quickly.
Nice, that would be great. A lot of our clients are completely distinct - if I am showing a demo or screenshot for "Client A", I do not want to show "Client B" at all, and I might want to see how all datasets interconnect for "Client A". There might be other approaches, but I have used tags to filter my DAGs UI view historically.