posteriordb
posteriordb copied to clipboard
Proposal: Add `data-used` to posterior `.json` files where relevant
Proposal:
modify the posterior .json
files to specify what data from the dataframe is actually used as an input to the model.
Rationale:
Some models only use a subset of their data. For example,earn-height
uses the earnings
data:
N => 1192
earn => [50000, 60000, 30000,...
height => [74, 66, 64, 63, 63, 64,...
male => [1, 0, 0, 0, 0, 0, 0,...
Of this data, earn-height
only uses a subset: N, earn, height
. This is fine for Stan, which will automatically discard data that doesn't match variables defined in the data
block.
Unfortunately, this is frustrating when trying to port PosteriorDB models to other PPLs. Many PPLs — notably Turing, but I think also PyMC, NumPyro, Gen, and so on — use some sort of overloaded function definition to define a probabilistic program, e.g.:
# generic-ppl-pseudocode:
@make_model function model_name(data_1, data_2, data_3){
prior ~ dist()
data_1 ~ dist(prior, smth ...)
}
In this setup, the data arguments need to exactly match the columns of the dataframe, and so the dataframe must be filtered beforehand to extract the relevant columns. To make this easier, it would be helpful to have a column in the dataframe specifying data-used
.
Example addition:
{
"name": "earnings-earn_height",
"keywords": ["arm book", "stan examples"],
"urls": "https://github.com/stan-dev/example-models/tree/master/ARM/Ch.4",
"model_name": "earn_height",
"data_name": "earnings",
"reference_posterior_name": "earnings-earn_height",
"references": "gelman2006data",
"dimensions": {
"beta": 2,
"sigma": 1
},
"added_date": "2020-01-17",
"added_by": "Oliver Järnefelt"
}
would become:
{
"name": "earnings-earn_height",
"keywords": ["arm book", "stan examples"],
"urls": "https://github.com/stan-dev/example-models/tree/master/ARM/Ch.4",
"model_name": "earn_height",
"data_name": "earnings",
"data_used": ["N", "earn", "height] # <--------------- the change is here
"reference_posterior_name": "earnings-earn_height",
"references": "gelman2006data",
"dimensions": {
"beta": 2,
"sigma": 1
},
"added_date": "2020-01-17",
"added_by": "Oliver Järnefelt"
}
This change would only need to occur for models where the provided dataframe is a superset of the actual dataframe.