DataLens copied to clipboard
[CCS 2021] "DataLens: Scalable Privacy Preserving Training via Gradient Compression and Aggregation" by Boxin Wang*, Fan Wu*, Yunhui Long*, Luka Rimanic, Ce Zhang, Bo Li
This is the official code base for our ACM CCS 2021 paper:
"DataLens: Scalable Privacy Preserving Training via Gradient Compression and Aggregation".
Boxin Wang*, Fan Wu*, Yunhui Long*, Luka Rimanic, Ce Zhang, Bo Li
title={DataLens: Scalable Privacy Preserving Training via Gradient Compression and Aggregation},
author={Wang, Boxin and Wu, Fan and Long, Yunhui and Rimanic, Luka and Zhang, Ce and Li, Bo},
journal={ACM Conference on Computer and Communications Security (CCS)},
Prepare your environment
The project is tested on Python 3.6, but a higher version of Python should also work. Download required packages
pip install -r requirements.txt
Prepare your data
Please store the training data in $data_dir
. By default, $data_dir
is set to ../../data
We provide a script to download the MNIST and Fashion Mnist datasets.
python [dataset_name]
For MNIST, you can run
python mnist
For Fashion-MNIST, you can run
python fashion_mnist
For CelebA and Places365 datasets, please refer to their official websites for downloading.
python --checkpoint_dir [checkpoint_dir] --dataset [dataset_name] --train --stochastic --signsgd --topk [topk]
For example, to train the Datalens on Fashion-MNIST given eps=1 and delta=1e-5
python --checkpoint_dir fmnist_z_dim_50_topk_200_teacher_4000_sigma_5000_thresh_0.7_pt_30_d_step_2_stochastic_1e-5/ \
--topk 200 --signsgd --norandom_proj --shuffle --teachers_batch 80 --batch_teachers 50 \
--dataset fashion_mnist --train --max_eps 1 --train --thresh 0.7 --sigma 5000 --nopretrain \
--z_dim 50 --nosave_epoch --epoch 300 --save_vote --d_step 2 --pretrain_teacher 10 --stochastic --max_grad 1e-5
By default, after it reaches the max epsilon, it will generate 10 batches of 10,000 DP samples as{i}.pkl
(i=0,...9) in checkpoint_dir
More example commands (eps=1):
python --checkpoint_dir [checkpoint-dir] \
--topk 200 --signsgd --norandom_proj --shuffle --teachers_batch 80 --batch_teachers 50 \
--dataset mnist --train --max_eps 1 --train --thresh 0.7 --sigma 5000 --nopretrain \
--z_dim 50 --save_epoch --epoch 300 --save_vote --d_step 2 --pretrain_teacher 10 --stochastic --max_grad 1e-5
python --checkpoint_dir [checkpoint-dir] \
--topk 200 --signsgd --norandom_proj --shuffle --teachers_batch 80 --batch_teachers 50 \
--dataset fashion_mnist --train --max_eps 1 --train --thresh 0.9 --sigma 5000 --nopretrain \
--z_dim 50 --nosave_epoch --epoch 300 --save_vote --d_step 2 --pretrain_teacher 10 --stochastic --max_grad 1e-5
python --checkpoint_dir [checkpoint-dir] \
--topk 700 --signsgd --norandom_proj --shuffle --teachers_batch 100 --batch_teachers 60 \
--dataset celebA-gender-train --train --max_eps 1 --train --thresh 0.85 --sigma 9000 --nopretrain \
--z_dim 100 --nosave_epoch --epoch 300 --save_vote --d_step 2 --pretrain_teacher 30 --stochastic --max_grad 1e-5
python --checkpoint_dir [checkpoint-dir]\
--topk 700 --signsgd --norandom_proj --shuffle --teachers_batch 100 --batch_teachers 80 \
--dataset celebA-hair-trn --train --max_eps 1 --train --thresh 0.9 --sigma 9000 --nopretrain \
--z_dim 100 --save_epoch --epoch 300 --save_vote --d_step 2 --pretrain_teacher 30 --stochastic --max_grad 1e-5
More example commands (eps=10):
python --checkpoint_dir [checkpoint-dir]/ \
--topk 300 --signsgd --norandom_proj --shuffle --teachers_batch 80 --batch_teachers 50 \
--dataset mnist --train --max_eps 10 --train --thresh 0.2 --sigma 800 --nopretrain \
--z_dim 50 --nosave_epoch --epoch 300 --save_vote --pretrain_teacher 10 --d_step 2 --stochastic --max_grad 1e-5
python --checkpoint_dir [checkpoint-dir] / \
--topk 350 --signsgd --norandom_proj --shuffle --teachers_batch 80 --batch_teachers 50 \
--dataset fashion_mnist --train --max_eps 10 --train --thresh 0.27 --sigma 1000 --nopretrain \
--z_dim 64 --nosave_epoch --epoch 300 --save_vote --pretrain_teacher 10 --d_step 2 --stochastic --max_grad 1e-5
python --checkpoint_dir [checkpoint-dir] / \
--topk 350 --signsgd --norandom_proj --shuffle --teachers_batch 80 --batch_teachers 50 \
--dataset fashion_mnist --train --max_eps 10 --train --thresh 0.27 --sigma 1000 --nopretrain \
--z_dim 64 --nosave_epoch --epoch 300 --save_vote --pretrain_teacher 10 --d_step 2
python --checkpoint_dir [checkpoint-dir] / \
--topk 500 --signsgd --norandom_proj --shuffle --teachers_batch 100 --batch_teachers 60 \
--dataset celebA-gender-train --train --max_eps 10 --train --thresh 0.12 --sigma 700 --nopretrain \
--z_dim 100 --nosave_epoch --epoch 300 --save_vote --pretrain_teacher 30 --d_step 2 --stochastic
python --checkpoint_dir [checkpoint-dir] / \
--topk 500 --signsgd --norandom_proj --shuffle --teachers_batch 80 --batch_teachers 50 \
--dataset celebA-hair-trn --train --max_eps 10 --train --thresh 0.25 --sigma 700 --nopretrain \
--z_dim 100 --nosave_epoch --epoch 300 --save_vote --pretrain_teacher 30 --d_step 2 --stochastic
Training Args
--ae: AE model name
(default: '')
--batch_size: The size of batch images [64]
(default: '30')
(an integer)
--batch_teachers: Number of teacher models in one batch
(default: '1')
(an integer)
--beta1: Momentum term of adam [0.5]
(default: '0.5')
(a number)
--checkpoint_dir: Directory name to save the checkpoints [checkpoint]
(default: 'checkpoint')
--checkpoint_name: checkpoint model name [checkpoint]
(default: 'checkpoint')
--[no]crop: True for cropping
(default: 'false')
--d_step: steps of the discriminator
(default: '1')
(an integer)
--data_dir: Root directory of dataset [data]
(default: '../../data')
--dataset: The name of dataset [cinic, celebA, mnist, lsun, fire-small]
(default: 'slt')
--delta: delta for differential privacy
(default: '1e-05')
(a number)
--epoch: Epoch for training teacher models
(default: '1000')
(an integer)
--[no]finetune_ae: Finetune ae
(default: 'false')
--g_epoch: Epoch for training the student models
(default: '500')
(an integer)
--g_step: steps of the generator
(default: '1')
(an integer)
--generator_dir: Directory name to save the generator
(default: 'generator')
--hid_dim: Dimmension of hidden dim
(default: '512')
(an integer)
--[no]increasing_dim: Increase the projection dimension for each epoch
(default: 'false')
--input_height: The size of image to use (will be center cropped).
(default: '32')
(an integer)
--input_width: The size of image to use (will be center cropped). If None, same value as input_height [None]
(default: '32')
(an integer)
--klevel: Levels of gradient quantization
(default: '4')
(an integer)
--[no]klevelsgd: Apply klevel sgd for gradient agggregation
(default: 'false')
--learning_rate: Learning rate of for adam
(default: '0.001')
(a number)
--[no]load_d: True for loading the pretrained models w/ discriminator, False for not load [True]
(default: 'true')
--loss: AE reconstruction loss
(default: 'l1')
--max_eps: maximum epsilon
(default: '1.0')
(a number)
--max_grad: maximum gradient for signsgd aggregation
(default: '0.0')
(a number)
--[no]mean_kernel: Apply Mean Kernel for gradient agggregation
(default: 'false')
--[no]non_private: Do not apply differential privacy
(default: 'false')
--orders: rdp orders
(default: '200')
(an integer)
--output_height: The size of the output images to produce [64]
(default: '32')
(an integer)
--output_width: The size of the output images to produce. If None, same value as output_height [None]
(default: '32')
(an integer)
--[no]pca: Apply pca for gradient aggregation
(default: 'false')
--pca_dim: principal dimensions for pca
(a number)
--[no]pretrain: True for loading the pretrained models, False for not load [True]
(default: 'true')
--pretrain_teacher: Pretrain teacher for epochs
(default: '0')
(an integer)
--proj_mat: #/ projection mat
(default: '1')
(an integer)
--[no]random_label: random labels for training data, only used when pretraining some models
(default: 'false')
--[no]random_proj: Apply pca for gradient aggregation
(default: 'true')
--sample_dir: Directory name to save the image samples [samples]
(default: 'samples')
--sample_step: Number of teacher models in one batch
(default: '10')
(an integer)
--[no]save_epoch: Save each epoch per 0.1 eps
(default: 'false')
--[no]save_vote: Save voting results
(default: 'false')
--[no]shuffle: Evenly distribute dataset
(default: 'true')
--sigma: Scale of gaussian noise for gradient aggregation
(default: '2000.0')
(a number)
--sigma_thresh: Scale of gaussian noise for thresh gnmax
(default: '4500.0')
(a number)
--[no]signsgd: Apply sign sgd for gradient agggregation
(default: 'false')
--[no]signsgd_dept: Apply sign sgd for gradient agggregation with data dependent bound
(default: 'false')
--[no]signsgd_nothresh: Apply sign sgd for gradient agggregation
(default: 'false')
--[no]simple_gan: Use fc to build GAN
(default: 'false')
--[no]sketchsgd: Apply sketch sgd for gradient agggregation
(default: 'false')
--[no]small: Use a smaller discriminator
(default: 'false')
--step_size: Step size for gradient aggregation
(default: '0.0001')
(a number)
--[no]stochastic: Apply stochastic sign sgd for gradient agggregation
(default: 'false')
--[no]tanh: Use tanh as activation func
(default: 'false')
--teacher_dir: Directory name to save the teacher [teacher]
(default: 'teacher')
--teachers_batch: Number of batch
(default: '1')
(an integer)
--thresh: threshhold for threshgmax
(default: '0.5')
(a number)
--topk: Number of top k gradients
(default: '50')
(an integer)
--[no]train: True for training, False for testing [False]
(default: 'false')
--[no]train_ae: Train ae
(default: 'false')
--train_size: The size of train images [np.inf]
(default: 'inf')
(a number)
--[no]wgan: Train wgan
(default: 'false')
--y_dim: #/ y dim
(default: '10')
(an integer)
--z_dim: #/ z dim
(default: '100')
(an integer)
Generating synthetic samples
python --checkpoint_dir [checkpoint_dir] --dataset [dataset_name]
Evaluate the synthetic records
We train a classifier on synthetic samples and test it on real samples. We put the evaluation script under the evaluation
python evaluation/ --data [DP_data_dir]
For Fashion-MNIST,
python evaluation/ --data [DP_data_dir]
For CelebA-Gender,
python evaluation/ --data [DP_data_dir]
For CelebA-Hair,
python evaluation/ --data [DP_data_dir]
The [DP_data_dir]
is where your generated DP samples are located. In the Fashion-MNIST example above, we have generated 10 bathces of DP samples in $checkpoint_dir/{i}.pkl
(i=0,...,9). During evaluation, you should run with the prefix of the data_dir
, where the program will concatenate all of the generated DP samples and use it as the training data.
python evaluation/ --data $checkpoint_dir/