soda-sql
soda-sql copied to clipboard
Add support for `include` and `read` functions
As discussed earlier last weekend's code restructure should allow for more freedom when defining a soda-sql project directory structure. To accomplish this we no longer make 'directory' assumptions and instead allow developers to inline their sql_metrics in the scan-yml file.
In order to prevent scan-yml files from getting too large we should also support 'macro functions' which allow developers to include
multiple files or read
the contents of a file. As proposed we should consider adding support for either one, or both, of the following functions:
-
include(./path/*.yml)
Thisinclude()
function can be used in combination withsql_metrics:
to include the contents of a list of files. The first argument can be a path containing wildcards or a directory which will then read and parse allyml
files. -
read(./path/query.sql)
Theread()
function allows to read the contents of a given file and use it as the properties value. This can for example be used in combination with thesql_metrics[].sql
property when referring to ametric.sql
query file.
An example of a scan-YML file which uses read()
could then look like:
// ./tables/orders.yml
table_name: CUSTOMERS
metrics:
- row_count
sql_metrics:
# Use inline custom metrics
- sql: |
SELECT
COUNT(CASE WHEN O_COUNTRY IS "NL" THEN 1 END) as dutch_customers,
FROM CUSTOMERS
tests:
no_dutch_customers: dutch_customers == 0
# Use inline custom metrics where the SQL is defined in a separate file
- sql: read(./sql/very_complex_sql.sql)
tests:
order_size: order_size > 1000
tests:
dataset_size: row_count == 0
An example using include()
would look like:
// ./tables/orders.yml
table_name: ORDERS
metrics:
- row_count
- sum
columns:
O_ORDERSTATUS:
valid_values:
- O
- F
sql_metrics: include(./metrics/orders/*.yml)
tests:
dataset_size: row_count == 0
// ./metrics/orders/my_metric.yml
sql: |
SELECT
COUNT(CASE WHEN O_ORDERSTATUS IS "P" THEN 1 END) as order_p,
COUNT(CASE WHEN O_ORDERSTATUS IS "O" THEN 1 END) as order_o,
FROM ORDERS
tests:
open_orders: order_o == 0
processing_orders: order_p > 0
Up for discussion:
- Do we go with both
read()
andinclude()
or do we want to merge them into one?
My two cents: merging into one might create more automagical behavior
and can therefore be less preferred.
Reflecting on this, my current thought is that we should go for a strategy to strictly separate the interpretation of the keys. I think that will lead to better error messages and simpler documentation.
sql_metrics
-> always expect a list of dicts
sql_metrics_includes
-> expects a list of file paths (with optional wild cards)
Similarly
sql_metrics.sql_file
-> file path
This would imply we may have to revisit the named tests as we currently also allow for 2 flavours on 1 key tests
. We could change that to
tests
-> list of unnamed test expressions
named_tests
-> dictionary of tests
Other thoughts?
I remember your suggested flavor to have been discussed in the engineering meeting leading up to the proposal as in this issue's OP, but there were some motivations from the team against it which were mostly related to verbosity, maintenance and reduced flexibility.
Some arguments for using include/read
functions which I can remember:
- It's seen more often in YAML parsers
- It fits the picture of "being flexible as a developer to compose the project structure the way you like". Since you can use
include
andrequire
for pretty much every key of you like, which thus' allows you to separate for example sql_metrics, or tests, or just thesql
in a sql_metric.
I can't remember all arguments from the discussion, so don't pin me on it.
My personal vote goes for the "function" approach given it's a universal solution which only has to be documented and explained once, where as the verbose xxx_list
, xxx_file
, xxx_include
requires (imo) more documentation.
So for sql_metrics you reproposing to go for offering different options like this?
sql_metrics: include(./path/*.yml)
sql_metrics:
- include(./path/*.yml)
- read(./anotherpath/someother.yml)
and
sql_metrics:
- sql: ...
tests: ...
- sql: ...
tests: ...
Where the last 2 could actually be combined.
And also
sql_metrics:
- sql: ...
tests: ...
- sql: read(./anotherpath/someother.sql)
tests: ...
I personally do like that flavor, yes
Note how read
could also be used to resolve https://github.com/sodadata/soda-sql/issues/166.
We should make sure sodadata/soda-sql#267 this is fixed as we implement this feature ([bug] sql_file is recongnized as valid sql_metric key sodadata/soda-sql#267)