pointblank icon indicating copy to clipboard operation
pointblank copied to clipboard

Checking relational data without joining manually before / reflect data structure in data dictionary

Open matthiasgomolka opened this issue 3 years ago • 1 comments

This is the follow-up from https://twitter.com/riannone/status/1305399926617145344.


I have data which is structured like a relational database using several distinct tables. Thus, many values need to be checked across two or more tables of the data structure.

One solution would be to join the tables in question before creating an agent. However, it would be nice if it was possible to somehow reference the other tables needed without joining explicitly.

I think this would also help in describing the checks in a YAML file / data dictionary. But this would need to be a slightly different kind of YAML than the one described in #149. One solution might be to include a columns section for each table of the full relational structure in the YAML, like:

tab1:
  columns:
    id:
      label: A short label.
      type: character
      info: An ID column. Can be used to merge tab1 with tab2.
    column_1:
      label: A short label.
      type: numeric
      info: Extended information about the column.
    ...

tab2:
  columns:
    id:
      label: A short label.
      type: character
      info: An ID column. Can be used to merge tab2 with tab1.
    column_2:
      label: A short label.
      type: numeric
      info: Extended information about the column.
    ...

Then, one could reference variables from other table with tab$var or tab[[var]] for example. But one would have to describe the relations of the different tables as well, otherwise joining the tables automatically in the backgroud would not work. Here's a rough example of what I mean:

relations:
  tab_1: tab1
  tab_2: tab2
  type: 1:n
  key_1: id
  key_2: id

This sounds like a lot of work but would greatly enhance the purpose of the YAML serving as a data dictionary.

matthiasgomolka avatar Sep 14 '20 07:09 matthiasgomolka

Hi Rich,

yesterday, I discussed this topic with some colleages. We had one key insight:

Relations between tables need to be named

  • This makes the formulation of quality checks across tables easier, since we can refer to a named relation, rather than several single tables.
  • Also, this enables us to create relations between a relation (say tabA and tabB) and another table (tabC). I don't see a different solution to this right now.
  • Thus, it helps even more to describe the data like a data dictionary.

Here is an updated example:

Tables

tabA:
  columns:
    idA:
      label: A short label.
      type: character
      info: An ID column. Can be used to merge tabA with tabB.
    column_1:
      label: A short label.
      type: numeric
      info: Extended information about the column.
    ...

tabB:
  columns:
    id1:
      label: A short label.
      type: character
      info: An ID column. Can be used to merge tabB with tabA.
    id2:
      label: A short label.
      type: character
      info: An ID column. Can be used to merge tabB with tabC.
    column_2:
      label: A short label.
      type: numeric
      info: Extended information about the column.
    ...
tabC:
  columns:
    idC:
      label: A short label.
      type: character
      info: An ID column. Can be used to merge tabC with tabB.
    column_2:
      label: A short label.
      type: numeric
      info: Extended information about the column.
    ...

Relations

relations:
  tabA_tabB:
    tab_1: tabA
    tab_2: tabB
    type: 1:n
    key_1: idA
    key_2: id1
  all_tabs:
    tab_1: tabA_tabB
    tab_2: tabC
    type: n:1
    key_1: id2
    key_2: idC

matthiasgomolka avatar Sep 30 '20 06:09 matthiasgomolka