bigquery-view-analyzer icon indicating copy to clipboard operation
bigquery-view-analyzer copied to clipboard

Providing General Purpose Dependency Analyzer

Open pykenny opened this issue 3 years ago • 4 comments

Hi team, I'm trying to analyze table dependencies in a set of BigQuery queries, but I haven't found any lightweight and robust solution written in Python yet:

  • Google released their parser framework but seems that it's almost not runnable on common laptop
  • Didn't find any decent parser that works on all my queries (all verified as valid syntax in BigQuery console) 😿

Currently I just only need to pick out all the tables that appears in the query with no BigQuery view involved, which I think the approach proposed in this package (regex-matching potential candidates) looks promising for solving this type of question. Right now this package requires BigQuery access to retrieve query defined in the dataset with no offline support. Any possibility to provide module that extracts potential source dependency (table and view names), given a string that represents some BigQuery SELECT statement?

pykenny avatar Oct 19 '21 09:10 pykenny

I actually looked into Google’s zetasql library for this project and came to a similar conclusion you did, it’s very heavyweight if all you want is to extract entities from some SQL.

Obviously the regex we’re using is coupled to the inner workings of this project, but there’s no reason it couldn’t work as a standalone module. If I get some time over the next few days I’ll see if I can extract it into a separate package that’ll work how you want it. I’ve been wanting to revisit the regex anyway since I don’t think it covers all possible permutations for formatting a table/view.

christippett avatar Oct 28 '21 20:10 christippett

Sounds great! Right at this moment I'm simply applying bqva.analyzer.find_query_objects on query strings to gather table information, then review dependencies between these queries manually.

Would be nice if it's possible to generate dependency graph automatically given a group of BQ queries, while users can opt in to verify views/tables under their project, or simply running the analysis offline.

pykenny avatar Oct 29 '21 03:10 pykenny

As for my case tables were specified in format `[dataset].[table]` in queries (though `[dataset]`.`[table]` seems to be more preferred), so the current regex pattern is supposed to catch all the occurrences correctly.

pykenny avatar Oct 29 '21 03:10 pykenny

@pykenny, FYI I've recently had to divulge myself of my involvement with Servian's open source projects, but @TWinsnes or @polleyg will happily take your request on board and prioritise accordingly.

christippett avatar Nov 20 '21 01:11 christippett