pointblank
pointblank copied to clipboard
Checking relational data without joining manually before / reflect data structure in data dictionary
This is the follow-up from https://twitter.com/riannone/status/1305399926617145344.
I have data which is structured like a relational database using several distinct tables. Thus, many values need to be checked across two or more tables of the data structure.
One solution would be to join the tables in question before creating an agent. However, it would be nice if it was possible to somehow reference the other tables needed without joining explicitly.
I think this would also help in describing the checks in a YAML file / data dictionary. But this would need to be a slightly different kind of YAML than the one described in #149. One solution might be to include a columns
section for each table of the full relational structure in the YAML, like:
tab1:
columns:
id:
label: A short label.
type: character
info: An ID column. Can be used to merge tab1 with tab2.
column_1:
label: A short label.
type: numeric
info: Extended information about the column.
...
tab2:
columns:
id:
label: A short label.
type: character
info: An ID column. Can be used to merge tab2 with tab1.
column_2:
label: A short label.
type: numeric
info: Extended information about the column.
...
Then, one could reference variables from other table with tab$var or tab[[var]] for example. But one would have to describe the relations of the different tables as well, otherwise joining the tables automatically in the backgroud would not work. Here's a rough example of what I mean:
relations:
tab_1: tab1
tab_2: tab2
type: 1:n
key_1: id
key_2: id
This sounds like a lot of work but would greatly enhance the purpose of the YAML serving as a data dictionary.
Hi Rich,
yesterday, I discussed this topic with some colleages. We had one key insight:
Relations between tables need to be named
- This makes the formulation of quality checks across tables easier, since we can refer to a named relation, rather than several single tables.
- Also, this enables us to create relations between a relation (say
tabA
andtabB
) and another table (tabC
). I don't see a different solution to this right now. - Thus, it helps even more to describe the data like a data dictionary.
Here is an updated example:
Tables
tabA:
columns:
idA:
label: A short label.
type: character
info: An ID column. Can be used to merge tabA with tabB.
column_1:
label: A short label.
type: numeric
info: Extended information about the column.
...
tabB:
columns:
id1:
label: A short label.
type: character
info: An ID column. Can be used to merge tabB with tabA.
id2:
label: A short label.
type: character
info: An ID column. Can be used to merge tabB with tabC.
column_2:
label: A short label.
type: numeric
info: Extended information about the column.
...
tabC:
columns:
idC:
label: A short label.
type: character
info: An ID column. Can be used to merge tabC with tabB.
column_2:
label: A short label.
type: numeric
info: Extended information about the column.
...
Relations
relations:
tabA_tabB:
tab_1: tabA
tab_2: tabB
type: 1:n
key_1: idA
key_2: id1
all_tabs:
tab_1: tabA_tabB
tab_2: tabC
type: n:1
key_1: id2
key_2: idC