QUIPP-pipeline icon indicating copy to clipboard operation
QUIPP-pipeline copied to clipboard

2011 census microdata play

Open edwardchalstrey1 opened this issue 4 years ago • 3 comments

As part of the Synthetic Data and Privacy Preservation - Turing/ONS partnership project 3, we're trying out the QUIPP pipeline on this dataset.

Note: may or may not need to ever merge this - just putting up so @ots22 can easily pull the branch

@ots22 I've attempted to modify the existing examples to run the different synth-method choices with stock parameters, only changing the parts referring to column names. Example 4, the SGF one, worked without any errors (I've set this one to enabled: true) - if you pull the branch and set enabled: false for any of the others you should hopefully get the errors I got for those.

On the SGF one, it seems to have generated a synthetic dataset! Only there are no values for the 2nd column (possible I wrongly chose categorical type for the column in the dataset json here, not sure)

Also, I created an issue #67 for the error I got on the CTGAN one - as I noticed the same error when I tried to run the existing CTGAN example from run-inputs

edwardchalstrey1 avatar Jul 14 '21 15:07 edwardchalstrey1

From our discussion in-person just now:

  • we're planning to drop CTGAN for now
  • we fixed a few errors in the synthpop parameters, and now a 'bootstrap' synthesis works
  • the classifiers run for a long time (to investigate)

ots22 avatar Jul 15 '21 10:07 ots22

I think classifiers run for a long time when no specific classifier with specific hyperparamters is passed in the run-inputs file. In this case, a number of classifiers are tested with many combinations of hyperparameters each. I recommend using something like this to reduce time. It uses only logistic regression with defined params.

gmingas avatar Jul 15 '21 11:07 gmingas

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB