triage
triage copied to clipboard
Changing how we create subsets
In the current version of triage, we view subsets
as subsets of entities independent from cohorts we create (unless we add the cohort query to the subset query). As a result, the subset tables tend to have duplicates of the same entity_id for many as of dates (even if those entities are not a part of the cohort for those as of dates) and tends to create very large tables.
This is more acute when we have a large universe of entities. To counter this, either we could include the cohort query in every subset query we write on the experiment config (e.g., as a CTE), or we could modify it under the hood to only include entities from the respective cohorts into the subset (treating subsets as a subset of a cohort rather than a subset of all entities). This PR is attempting to do the latter.
Merging this PR will:
- Create an extended class of the
EntityDateTableGenerator
to handle subset tables calledSubsetEntityDateTableGenerator
. The extended class adds an automatic inner join with the cohort table to make sure the subset table only includes entities that belong to the cohort of the respective date. - Add the
cohort_table_name
to thesubset_config
, so that we can track the cohort table that the subset belong to and thesubset_hash
reflects the cohort table