Prototype Lineage Analysis Tooling
Is your feature request related to a problem? Please describe. Currently, when given a Hamilton DAG, we don't expose ways to ask questions about it.
E.g. For GDPR, Data providence, etc.
E.g.
- What if I remove this input, what function(s) will I impact?
- What uses some PII data and what is the surface area?
- If someone requests to be forgotten, what data do I need to delete?
- Who should I talk to when I want to make this change that impacts these functions ? (e.g. use git blame to surface function owner?)
- What has changed about the DAG since these two commits?
- Are there any cycles?
- Are there clusters of disjoint nodes? If so, what are they, maybe I can delete them?
- etc
Describe the solution you'd like This could be a specific "driver class", or something added to the base driver.
Without an end user workflow in mind, it's a bit hard to specify the API.
Also, perhaps this would work well with #4 -- e.g. tagging what is PII, and what isn't?
Describe alternatives you've considered N/A
Additional context There are a lot of start ups and organizations trying to get a handle on their data and where it is used. Hamilton can help provide a way to get at this easily...
OpenLineage looks exciting: https://openlineage.io/.
Talked with the folks from selectstar last night -- might be an interesting potential integration: https://www.selectstar.com/
We are moving repositories! Please see the new version of this issue at https://github.com/DAGWorks-Inc/hamilton/issues/15. Also, please give us a star/update any of your internal links.