model-res-avm
model-res-avm copied to clipboard
Create a feature selection/evaluation template
The current CCAO feature evaluation process for new model features is very ad-hoc. We typically look at the change in model performance metrics before and after the addition of a new feature, as well as the absolute SHAP values associated with that feature.
In order to make this ad-hoc process slightly more repeatable and rigorous, we should create a template Quarto document we can use to evaluate new features. This document should contain both standard, repeatable sections (e.g. model performance stats by township) and a series of questions that will likely require additional ad-hoc analysis. For example, given a question like "Where is the new model feature most impactful?" and a feature that adds distance to stadium, one might add maps of PIN-level SHAP values surrounding each stadium.
Goal
The goal here is to remove (or exclude in the first place) features which have no predictive power in any geography (i.e. they are merely noise). The goal is not to remove features which may be redundant, only mildly predictive, or only predictive in certain geographic areas; all of this work is done more-or-less automatically by the model, which is regularized and performs other forms of de-facto feature selection.
Task
Create a Quarto document at analyses/new-feature-template.qmd
that can be used to evaluate whether or not new features are merely noise. The document should:
- Be buildable from existing data:
- It should reference a new model output run containing the new feature you added, and should load the modeling results directly from S3
- It should use the output metadata to load the input data from DVC, using the DVC S3 cache
- Basically, anyone should be able to click render on the document and have it build, assuming they have credential access to S3
- Exist as a one-off. Unlike the documents in
reports/
(which are rendered on every run), documents inanalyses/
are only run to answer a specific, one-time question. - Contain three types of content:
- Templated content which does not need to be changed per feature, i.e. model performance statistics, aggregate SHAP plots, etc.
- Ad-hoc content specific to that feature (see stadium example above)
- Text that explains the plots/tables and indicates what decision was reached
This document can be copied and then renamed for each new feature added, similar to the workflow for enterprise intelligence.
@ccao-jardine