soda-sql icon indicating copy to clipboard operation
soda-sql copied to clipboard

Add support for `include` and `read` functions

Open dirkgroenen opened this issue 4 years ago • 7 comments

As discussed earlier last weekend's code restructure should allow for more freedom when defining a soda-sql project directory structure. To accomplish this we no longer make 'directory' assumptions and instead allow developers to inline their sql_metrics in the scan-yml file.

In order to prevent scan-yml files from getting too large we should also support 'macro functions' which allow developers to include multiple files or read the contents of a file. As proposed we should consider adding support for either one, or both, of the following functions:

  • include(./path/*.yml) This include() function can be used in combination with sql_metrics: to include the contents of a list of files. The first argument can be a path containing wildcards or a directory which will then read and parse all yml files.

  • read(./path/query.sql) The read() function allows to read the contents of a given file and use it as the properties value. This can for example be used in combination with the sql_metrics[].sql property when referring to a metric.sql query file.

An example of a scan-YML file which uses read() could then look like:

// ./tables/orders.yml
table_name: CUSTOMERS
metrics:
  - row_count
sql_metrics:
  # Use inline custom metrics
  - sql: |
      SELECT
        COUNT(CASE WHEN O_COUNTRY IS "NL" THEN 1 END) as dutch_customers,
      FROM CUSTOMERS
    tests:
      no_dutch_customers: dutch_customers == 0
  # Use inline custom metrics where the SQL is defined in a separate file
  - sql: read(./sql/very_complex_sql.sql)
    tests:
      order_size: order_size > 1000
tests:
  dataset_size: row_count == 0

An example using include() would look like:

// ./tables/orders.yml
table_name: ORDERS
metrics:
  - row_count
  - sum
columns:
  O_ORDERSTATUS:
    valid_values:
      - O
      - F
sql_metrics: include(./metrics/orders/*.yml)
tests:
  dataset_size: row_count == 0
// ./metrics/orders/my_metric.yml
sql: |
  SELECT
    COUNT(CASE WHEN O_ORDERSTATUS IS "P" THEN 1 END) as order_p,
    COUNT(CASE WHEN O_ORDERSTATUS IS "O" THEN 1 END) as order_o,
  FROM ORDERS
tests:
  open_orders: order_o == 0
  processing_orders: order_p > 0

Up for discussion:

  • Do we go with both read() and include() or do we want to merge them into one?

My two cents: merging into one might create more automagical behavior and can therefore be less preferred.

dirkgroenen avatar Feb 08 '21 10:02 dirkgroenen

Reflecting on this, my current thought is that we should go for a strategy to strictly separate the interpretation of the keys. I think that will lead to better error messages and simpler documentation.

sql_metrics -> always expect a list of dicts sql_metrics_includes -> expects a list of file paths (with optional wild cards)

Similarly sql_metrics.sql_file -> file path

This would imply we may have to revisit the named tests as we currently also allow for 2 flavours on 1 key tests. We could change that to tests -> list of unnamed test expressions named_tests -> dictionary of tests

Other thoughts?

tombaeyens avatar Feb 19 '21 12:02 tombaeyens

I remember your suggested flavor to have been discussed in the engineering meeting leading up to the proposal as in this issue's OP, but there were some motivations from the team against it which were mostly related to verbosity, maintenance and reduced flexibility.

Some arguments for using include/read functions which I can remember:

  • It's seen more often in YAML parsers
  • It fits the picture of "being flexible as a developer to compose the project structure the way you like". Since you can use include and require for pretty much every key of you like, which thus' allows you to separate for example sql_metrics, or tests, or just the sql in a sql_metric.

I can't remember all arguments from the discussion, so don't pin me on it.

My personal vote goes for the "function" approach given it's a universal solution which only has to be documented and explained once, where as the verbose xxx_list, xxx_file, xxx_include requires (imo) more documentation.

dirkgroenen avatar Feb 19 '21 12:02 dirkgroenen

So for sql_metrics you reproposing to go for offering different options like this?

sql_metrics: include(./path/*.yml)
sql_metrics: 
    - include(./path/*.yml)
    - read(./anotherpath/someother.yml)

and

sql_metrics: 
    - sql: ...
      tests: ...
    - sql: ...
      tests: ...

Where the last 2 could actually be combined.

tombaeyens avatar Feb 19 '21 13:02 tombaeyens

And also

sql_metrics: 
    - sql: ...
      tests: ...
    - sql: read(./anotherpath/someother.sql)
      tests: ...

tombaeyens avatar Feb 19 '21 13:02 tombaeyens

I personally do like that flavor, yes

dirkgroenen avatar Feb 19 '21 13:02 dirkgroenen

Note how read could also be used to resolve https://github.com/sodadata/soda-sql/issues/166.

dirkgroenen avatar Mar 10 '21 19:03 dirkgroenen

We should make sure sodadata/soda-sql#267 this is fixed as we implement this feature ([bug] sql_file is recongnized as valid sql_metric key sodadata/soda-sql#267)

tombaeyens avatar Mar 23 '21 08:03 tombaeyens