ucx
ucx copied to clipboard
[FEATURE]: Build and display dataset lineage to partition/schedule code migrations more effectively
Is there an existing issue for this?
- [X] I have searched the existing issues
Problem statement
Table mapping does not solve everything, chances are there's still error after migration. Since HMS lineage is there, UCX should target merging with dependencyGraph
flowchart TD
storage_path -->|reads| view
storage_path -->|reads| table
storage_path -->|reads| notebook
storage_path -->|reads| py_file
storage_path -->|reads| redash_query
storage_path -->|reads| pipeline
table --> view
view --> table
table -->|reads| notebook
notebook -->|writes| table
table -->|reads| pipeline
pipeline -->|writes| table
table -->|reads| py_file
py_file -->|writes| table
table -->|reads| redash_query
redash_query -->|writes| table
redash_query --> dashboard
dashboard --> warehouse
table -->|reads| lakeview_dashboard
lakeview_dashboard --> warehouse
notebook --> pipeline
pipeline --> job
notebook --> job
wheel --> job
py_file --> job
py_file --> git_repo
py_file --> wheel
notebook --> git_repo
git_repo -->|?| job
cluster_policy --> cluster
cluster_policy --> job
job --> cluster
warehouse -.-> cluster
Proposed Solution
Merge HMS lineage with dependencyGraph. While it is dependent on the version of the DBR, it should start with the highest runtime and then backfill anything that's not captured using other means. Static lineage parsing or Spark listener.
scope:
| asset | has_owner_user | listing speed | AST analysis required |
|---|---|---|---|
| storage path | no | slow (via AST analysis) | yes |
| view | no | medium, via tables.scala | yes |
| table | no | medium, via tables.scala | no |
| pipeline | yes | fast | ** yes |
| notebook | yes | slow, via workflow linter | yes |
| wheel | no | slow, via linter | yes |
| job | yes | fast | no |
| cluster | yes | fast | no |
| cluster_policy | yes | fast | no |
| git_repo | no | - | no |
| py_file | no | slow, via workflow linter | yes |
| redash query | yes | medium | yes |
| redash dashboard | yes | medium | no |
| lakeview dashboard | yes | fast | yes |
| warehouse | yes | fast | no |
Optionally, we can create multiple copies of the same graph with starting points from a single table to show full migration scope.
Outcome:
- show dashboard with
Asset (link),Asset type,Owner,Failurescolumns, sorted by their required order. Filter on owner. - topologically sort all assets in order to schedule migration
- (optionally) process graph to group islands of dependencies in order to parallelise migration or find opportunities
- new table: lineage
Additional Context
No response