bigquery-view-analyzer
bigquery-view-analyzer copied to clipboard
Providing General Purpose Dependency Analyzer
Hi team, I'm trying to analyze table dependencies in a set of BigQuery queries, but I haven't found any lightweight and robust solution written in Python yet:
- Google released their parser framework but seems that it's almost not runnable on common laptop
- Didn't find any decent parser that works on all my queries (all verified as valid syntax in BigQuery console) 😿
Currently I just only need to pick out all the tables that appears in the query with no BigQuery view involved, which I think the approach proposed in this package (regex-matching potential candidates) looks promising for solving this type of question. Right now this package requires BigQuery access to retrieve query defined in the dataset with no offline support. Any possibility to provide module that extracts potential source dependency (table and view names), given a string that represents some BigQuery SELECT statement?
I actually looked into Google’s zetasql library for this project and came to a similar conclusion you did, it’s very heavyweight if all you want is to extract entities from some SQL.
Obviously the regex we’re using is coupled to the inner workings of this project, but there’s no reason it couldn’t work as a standalone module. If I get some time over the next few days I’ll see if I can extract it into a separate package that’ll work how you want it. I’ve been wanting to revisit the regex anyway since I don’t think it covers all possible permutations for formatting a table/view.
Sounds great! Right at this moment I'm simply applying bqva.analyzer.find_query_objects
on query strings to gather table information, then review dependencies between these queries manually.
Would be nice if it's possible to generate dependency graph automatically given a group of BQ queries, while users can opt in to verify views/tables under their project, or simply running the analysis offline.
As for my case tables were specified in format `[dataset].[table]`
in queries (though `[dataset]`.`[table]`
seems to be more preferred), so the current regex pattern is supposed to catch all the occurrences correctly.
@pykenny, FYI I've recently had to divulge myself of my involvement with Servian's open source projects, but @TWinsnes or @polleyg will happily take your request on board and prioritise accordingly.