triage icon indicating copy to clipboard operation
triage copied to clipboard

Changing how we create subsets

Open kasunamare opened this issue 8 months ago • 0 comments

In the current version of triage, we view subsets as subsets of entities independent from cohorts we create (unless we add the cohort query to the subset query). As a result, the subset tables tend to have duplicates of the same entity_id for many as of dates (even if those entities are not a part of the cohort for those as of dates) and tends to create very large tables.

This is more acute when we have a large universe of entities. To counter this, either we could include the cohort query in every subset query we write on the experiment config (e.g., as a CTE), or we could modify it under the hood to only include entities from the respective cohorts into the subset (treating subsets as a subset of a cohort rather than a subset of all entities). This PR is attempting to do the latter.

Merging this PR will:

  • Create an extended class of the EntityDateTableGenerator to handle subset tables called SubsetEntityDateTableGenerator. The extended class adds an automatic inner join with the cohort table to make sure the subset table only includes entities that belong to the cohort of the respective date.
  • Add the cohort_table_name to the subset_config, so that we can track the cohort table that the subset belong to and the subset_hash reflects the cohort table

kasunamare avatar May 30 '24 17:05 kasunamare