QUIPP-pipeline
QUIPP-pipeline copied to clipboard
2011 census microdata play
As part of the Synthetic Data and Privacy Preservation - Turing/ONS partnership project 3, we're trying out the QUIPP pipeline on this dataset.
Note: may or may not need to ever merge this - just putting up so @ots22 can easily pull the branch
@ots22 I've attempted to modify the existing examples to run the different synth-method choices with stock parameters, only changing the parts referring to column names. Example 4, the SGF one, worked without any errors (I've set this one to enabled: true) - if you pull the branch and set enabled: false for any of the others you should hopefully get the errors I got for those.
On the SGF one, it seems to have generated a synthetic dataset! Only there are no values for the 2nd column (possible I wrongly chose categorical type for the column in the dataset json here, not sure)
Also, I created an issue #67 for the error I got on the CTGAN one - as I noticed the same error when I tried to run the existing CTGAN example from run-inputs
From our discussion in-person just now:
- we're planning to drop CTGAN for now
- we fixed a few errors in the synthpop parameters, and now a 'bootstrap' synthesis works
- the classifiers run for a long time (to investigate)
I think classifiers run for a long time when no specific classifier with specific hyperparamters is passed in the run-inputs file. In this case, a number of classifiers are tested with many combinations of hyperparameters each. I recommend using something like this to reduce time. It uses only logistic regression with defined params.
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB