spectacles icon indicating copy to clipboard operation
spectacles copied to clipboard

SQL validator (or other) should test the uniqueness constraint of all views

Open DylanBaker opened this issue 5 years ago • 5 comments
trafficstars

When primary keys are defined on a view, it is important that the key is unique, otherwise asymmetric aggregates get calculated wrong. We should be able to test if this are unique.

DylanBaker avatar Mar 22 '20 16:03 DylanBaker

What would it take to implement this? Here are a few questions to get us started:

  1. Is this a separate validator or something that goes in the SQL validator? While it is definitely SQL validation, if it needs a bunch of extra configuration it may be easier to drop it in a separate subcommand like spectacles primary or spectacles unique or even spectacles keys.

  2. My naive approach would be building a query in Looker with the primary key dimension and a count measure, filtering where the count measure is greater than one and checking zero rows are returned. Is there a better approach? What do we do if the user hasn't defined a count measure?

  3. What performance implications might this have? This is different than our current SQL validation, in that these queries will have to scan data and can't be run with LIMIT 0. What do we need to do to make sure this is used responsibly so people don't try to validate uniqueness where tables are much too large?

  4. What does the output look like? In my experience debugging primary key duplication, I want to know a) the extent of duplication (i.e. 3 duplicate keys or 3,000) and b) a few examples of duplicate keys (rank ordered in descending order by how duplicated each key is) so I can start debugging quickly. I think we would want to save the SQL for these queries as well so people can use them for debugging.

joshtemple avatar Jun 09 '20 13:06 joshtemple

  1. For me, this is a different validator. It's 'running' SQL but I think does a different job from the SQL validator. I have a slight preference for spectacles primary, but really would be happy with any of the three you've suggested.

  2. I think this approach could also fail if they've defined their joins poorly. I imagined this being done via the SQL runner. I also think we will need to parse the project, because I don't believe we can get the primary keys via the API. I was imagining something like: Parse project to get every PK for each view, generate a SQL runner query that references the view, return the number of non-unique results. (More on that last point in 4.)

  3. This is a really good question. Not sure what the answer is. I can't think of a way that it wouldn't do at least a full scan on the primary key column.

  4. As an MVP, my expectation would be the number of duplicated rows and a query to debug them. My worry about showing examples is that we would be doing another scan.

In terms of the SQL we would run, I believe we can do something like this via the SQL runner:

select count(*) from (
  select ${dim_customers.customer_id}
  from ${dim_customers.SQL_TABLE_NAME}
  group by 1
  having count(*) > 1
)

The only caveat to this is that we would need to know a model that has access to a view. This is because it needs to know what connection to use. That view doesn't actually have to be used in an explore in that model. It just needs to be one of the views included by the include: lines of the model file.

DylanBaker avatar Jun 09 '20 14:06 DylanBaker

Can you elaborate on this?

I think this approach could also fail if they've defined their joins poorly.

I have worries about using the SQL runner for a few reasons. First, it requires granting full query access to the user's warehouse (vs. queries in Looker which is a more limited set of options). A user may want to scope their Spectacles API key to only have query permissions and not have SQL runner permissions.

Second, using the SQL runner might require us to know the SQL syntax of the user's data warehouse. The query you're proposing is pretty basic, standard SQL, but I would prefer to leverage's Looker's ability to write the SQL for us if possible.

I also think we will need to parse the project, because I don't believe we can get the primary keys via the API.

This is a great point that I had forgotten about. So we will need to access the actual LookML, which means the user will need to provide a GitHub access token so we can retrieve the code and parse it. This is has the side effect of restricting this feature to Git hosts that we can support (i.e. we will need to do this for GitHub, GHE, GitLab, Bitbucket, etc.)

joshtemple avatar Jun 09 '20 15:06 joshtemple

I think this approach could also fail if they've defined their joins poorly.

I think I may be wrong about this. But, my thinking was that if it's a one-to-many join, I'm not sure the count on the view would necessarily return what we would get if we were just querying the view itself, because of what Looker does to manage symmetric aggregates, given that it assumes the column IS unique.

DylanBaker avatar Jun 09 '20 15:06 DylanBaker

I think your points are totally reasonable. I would need to think about it more, but my gut instinct is that it would be tough to test all views with explores unless we had them create a hidden explore for each view, which sucks as UX.

With respect to the git hosts, I think we could just take a git URL and an SSH key, couldn't we? (This is what dbt Cloud does for git services it doesn't fully integrate with.) That way, they don't get PR functionality, but we would be able to clone their repo to parse it.

DylanBaker avatar Jun 09 '20 15:06 DylanBaker