awesome-privacy-engineering
awesome-privacy-engineering copied to clipboard
A curated list of resources related to privacy engineering
Awesome Privacy Engineering 
A curated list of resources related to privacy engineering
Content
- Courses
- Books
- Data Deletion and Data Subject Access Requests
- Privacy Tech Series
- Privacy Threat Modeling
- Machine Learning and Algorithmic Bias
- Facial Recognition
- De-Identification and Anonymization
- Homomorphic Encryption
- Tokenization
- Secure Multi-Party Computation
- Synthetic Data
- Differential Privacy and Federated Learning
- Designing for Trust with Users
- Dark Patterns in Digital Services
- Tagging Personally Identifiable Information
- Regulatory Resources
- Conferences
- Miscelleaneous
- Other Awesome Privacy Curations
- Related Github Topics
Courses
-
OpenMined Courses
- Our Privacy Opportunity (Beginner) (7.7 hours)
- Introduction to Remote Data Science (Intermediate) (8 hours)
- Foundations of Private Computation (Intermediate) (60 hours)
- Federated Learning on Mobile (Intermediate) (40 hours)
- Data Privacy and Anonymization in R - Datacamp course that covers publicly releasing data sets with a differential privacy guarantee.
- Data Privacy and Anonymization in Python - Datacamp course on learning to process sensitive information with privacy-preserving techniques.
- Secure and Private AI (Udacity) - Udacity course that covers how to extend PyTorch with the tools necessary to train AI models that preserve user privacy.
- Practical Data Ethics - This class was originally taught in-person at the University of San Francisco Data Institute in January-February 2020.
- Privacy-Conscious Computer Systems - This class at Brown University (CSCI 2390) focuses on how to design computer systems that protect users' privacy.
- Privacy by Design: Data Classification - LinkedIn Learning course by Nishant Bhajaria.
- Privacy by Design: Data Sharing - LinkedIn Learning course by Nishant Bhajaria.
- Implementing a Privacy, Risk, and Assurance Program - LinkedIn Learning course by Nishant Bhajaria.
- Data Protocol - Courses to teach developers and technical professionals how to build products responsibly and partner with platforms effectively.
- Carnegie Mellon University - Privacy Engineering Certificate - Four-week certificate program that revolves around a combination of mini-tutorials, class discussions, and hands-on exercises designed to ensure that students develop practical knowledge of all key privacy engineering areas.
Books
- The Privacy Engineer's Manifesto: Getting from Policy to Code to QA to Value (Michelle Dennedy, Jonathan Fox, Tom Finneran)
- Information Privacy Engineering and Privacy by Design: Understanding Privacy Threats, Technology, and Regulations Based on Standards and Best Practices (William Stallings)
- The Algorithmic Foundation of Differential Privacy (Cynthia Dwork, Aaron Roth)
- Building an Anonymization Pipeline: Creating Safe Data (Luk Arbuckle, Khaled El Emam)
- Strategic Privacy by Design (R. Jason Cronk)
- The Architecture of Privacy: On Engineering Technologies that Can Deliver Trustworthy Safeguards (Courtney Bowman, Ari Gesher, John K. Grant, Daniel Slate, Elissa Lerner)
- Data Privacy: A Runbook for Engineers (Nishant Bhajaria)
- Privacy Design Strategies (The Little Blue Book) (Jaap-Henk Hoepman)
- Data Privacy: What Enterprises Need to Know? (Deepak Gupta)
Data Deletion and Data Subject Access Requests
- Deleting Data Distributed Throughout Your Microservices Architecture - Microservices architectures tend to distribute responsibility for data throughout an organization. This poses challenges to ensuring that data is deleted.
- Handling Data Erasure Requests in Your Data Lake with Amazon S3 Find and Forget - Amazon S3 Find and Forget enables you to find and delete records automatically in data lakes on Amazon S3.
- How to Delete User Data in an AWS Data Lake - This post walks through a framework that helps you purge individual user data within your organization’s AWS hosted data lake, and an analytics solution that uses different AWS storage layers, along with sample code targeting Amazon S3.
- Best Practices: GDPR and CCPA Compliance Using Delta Lake - Article that describes how to use Delta Lake on Databricks to manage General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) compliance for a data lake.
- Klaro! - Klaro is a simple consent management platform (CMP) and privacy tool that helps you to be transparent about the third-party applications on your website.
- OpenDSR - A common framework enabling companies to work together to protect consumers' privacy and data rights (formerly known as OpenGDPR.)
- PrivacyBot - PrivacyBot is a simple automated service to initiate CCPA deletion requests with data brokers.
- Fides - An open-source tool that allows you to easily declare your systems' privacy characteristics, track privacy related changes to systems and data in version control, and enforce policies in both your source code and your runtime infrastructure.
- Cookie Consent - An opensource, lightweight JavaScript plugin for alerting users about the use of cookies on a website. It is designed to help quickly comply with the Eureopean Union Cookie Law, CCPA, GDPR and other privacy laws.
Privacy Tech Series by Lea Kissner
- Interface Design: The Who/What/Where Rule
- Vulnerability versus Incident
- Deidentification versus Anonymization
- Aggregating Over Anonymized Data
- Thinking Through ACL-Aware Data Processing
- Settings and Surfaces
- Comprehensible Access Control Lists
- Data Retention in a Distributed System
- Setting Data Retention Timelines
- Handling Human Names
Privacy Threat Modeling
- LINDDUN - The LINDDUN privacy engineering framework provides systematic support for the elicitation and mitigation of privacy threats in software systems.
- LINDDUN GO - LINDDUN GO is designed to give you a quick start to privacy threat modeling.
Machine Learning and Algorithmic Bias
- Pribot and Polisis - Polisis is a unique way of visualizing privacy policies. Using deep learning, it allows you to know what the company is collecting about you, what it is sharing, etc.
- Ethical Machine Learning - Spotting and Preventing Proxy Bias - Jupyter Notebook from rOpenSciLabs that explores several ways of detecting unintentional bias and removing it from a predictive model.
- Aequitas - An open source bias audit toolkit developed by the Center for Data Science and Public Policy at University of Chicago, can be used to audit the predictions of machine learning based risk assessment tools to understand different types of biases, and make informed decisions about developing and deploying such systems.
- Fairness in Machine Learning Engineering - Google's Machine Learning Crash Course includes a 70-minute section on fairness.
- How to Incorporate Ethics and Risk into Your Machine Learning Development Process - To help highlight ethics and risk in machine learning, this article looks at the six steps involved in developing an ML system, what happens in each step, and the risk and ethics questions that arise.
- DrivenData: Deon - A command line tool to easily add an ethics checklist to your data science projects.
-
People + AI Guidebook - A friendly, practical guide that lays out some best practices for creating useful, responsible AI applications.
- Why Some Models Leak Data - Machine learning models use large amounts of data, some of which can be sensitive. If they're not trained correctly, sometimes that data is inadvertently revealed.
- Datasets Have Worldviews - Every dataset communicates a different perspective. When you shift your perspective, your conclusions can shift, too.
- Measuring Fairness - How do you make sure a model works equally well for different groups of people?
- How Randomized Response Can Help Collect Sensitive Information Responsibly - Giant datasets are revealing new patterns in cancer, income inequality and other important areas. However, the widespread availability of fast computers that can cross reference public data is making it harder to collect private information without inadvertently violating people's privacy. Modern randomization techniques can help preserve anonymity.
- Can a Model Be Differentially Private and Fair? - Training with differential privacy limits the information about any one data point that is extractable but in some cases there’s an unexpected side-effect: reduced accuracy with underrepresented subgroups disparately impacted.
- Hidden Bias - Models trained on real-world data can encode real-world bias. Hiding information about protected classes doesn't always fix things — sometimes it can even hurt.
- Fairlearn - A Python package to assess and improve fairness of machine learning models.
- InterpretML - A toolkit to help understand models and enable responsible machine learning.
- ML Privacy Meter - A tool to quantify the privacy risks of machine learning models with respect to inference attacks, notably membership inference attacks
- Failure Modes in Machine Learning - Privacy concerns can include model inversion, membership inference attack, etc.
- Privacy Considerations in Large Language Models - The potential for models to leak details from the data on which they’re trained may be a concern for all large language models, and additional issues may arise if a model trained on private data were to be made publicly available.
- Explaining Decisions Made with AI - Guidance by the UK's Information Commissioner's Office (ICO) and The Alan Turing Institute aims to give organisations practical advice to help explain the processes, services and decisions delivered or assisted by AI, to the individuals affected by them.
- Considerations for Sensitive Data within Machine Learning Datasets - This Google Cloud article aims to highlight some strategies for identifying and protecting sensitive information, and processes to help address security concerns you might have with your machine learning data.
- Responsible AI Toolbox - Responsible AI Toolbox is a suite of tools from Microsoft that provides a collection of model and data exploration and assessment user interfaces that enable a better understanding of AI systems. The Toolbox consists of four dashboards: an Error Analysis dashboard, an Interpretability dashboard, a Fairness dashboard, and a Responsible AI dashboard.
- Of Oaths and Checklists - A checklist for people who are working on data projects, authored by DJ Patil, Hilary Mason, and Mike Loukides.
- Intro to AI Ethics - A Kaggle Learn course to explore practical tools to guide the moral design of AI systems.
- Failure Modes in Machine Learning - Documentation compiled by Microsoft regarding the different ways that machine learning can fail, both intentionally (through adversarial attack) and unintentionally (formally correct but completely unsafe outcome).
- Apple Privacy-Preserving Machine Learning Workshop 2022 - In June 2022, Apple hosted the Workshop on Privacy-Preserving Machine Learning (PPML), which brought Apple and members of the academic research communities together to discuss the state of the art in the field of privacy-preserving machine learning through a series of talks and discussions. This post includes highlights from workshop discussions and recordings of select workshop talks.
Facial Recognition
- Understanding Facial Detection, Characterization and Recognition Technologies (Future of Privacy Forum (FPF) Infographic)
- NIST Biometric Research Dataset - Stripped of identifying information and created expressly for research purposes, the data is designed primarily for testing systems that verify a person’s identity before granting access.
- Fawkes - Fawkes, privacy preserving tool against facial recognition systems, developed by researchers at SANDLab, University of Chicago.
De-Identification and Anonymization
- A Visual Guide to Practical Data De-Identification (FPF Infographic)
- NIST Privacy Engineering Program - De-Identification Tools
- Presidio - Context aware, pluggable and customizable PII anonymization service for text and images, developed by Microsoft.
- Redacting Sensitive Information with User-Defined Functions in Amazon Athena - Amazon Athena supports user-defined functions, a feature that enables you to write custom scalar functions and invoke them in SQL queries.
- AWS AI-Powered Health Data Masking - The AI-Powered Health Data Masking solution in the AWS Solutions Library helps healthcare organizations identify and mask health data in images or text.
- Anonymize Your Data Using Amazon S3 Object Lambda - Leverage AWS S3 Object Lambdas in order to anonymize data.
- Static Data Masking for Azure SQL Database and SQL Server - Microsoft's Static Data Masking is a data protection feature that helps users sanitize sensitive data in a copy of their SQL databases. It is compatible with SQL Server (SQL Server 2012 and newer), Azure SQL Database (DTU and vCore-based hosting options, excluding Hyperscale), and SQL Server on Azure Virtual Machines.
- Google Cloud Data Loss Prevention - Google Cloud's fully managed service designed to help you discover, classify, and protect sensitive data.
- ARX Data Anonymization Tool - ARX is a comprehensive open source software for anonymizing sensitive personal data.
- UTD Anonymization ToolBox - UT Dallas Data Security and Privacy Lab compiled various anonymization methods into a toolbox for public use by researchers.
- Kodex - An open-source toolkit for privacy and security engineering. It helps you to automate data security and data protection measures in your data engineering workflows.
- Data Anonymizer Extension for PostgreSQL - A set of SQL functions that remove personally identifiable values from a PostgreSQL table and replace them with random-but-plausible values.
- Anonimatron - Free, extendable, open source data anonymization tool.
- Anonymizer MySQL - This simple tool will allow you to make anonymizerd clone of your database.
- MySQL Data Anonymizer - MySQL Data Anonymizer is a PHP library that anonymizes your data in the database.
- Anonymizer - Anonymizer is a universal tool to create anonymized DBs for projects.
- Singapore Guide to Anonymization - The Singapore Personal Data Protection Commission (PDPC) has published the Guide on Basic Anonymization to provide more practical guidance for businesses on how to appropriately perform basic anonymization and de-identification of various datasets.
- Transforming Data in Google Cloud Platform - This reference covers the available de-identification techniques, or transformations, that can be applied in Google Cloud's Data Loss Prevention (i.e., redaction, replacement, masking, crypto-based tokenization, bucketing, date shifting, and time extraction).
- Measuring Re-Identification Risk / Privacy Definitions - A series of helpful blog posts by Damien Desfontaines on privacy definitions that attempt to quantify the level of risk associated with a dataset.
- Technical Privacy Metrics: a Systematic Survey - Paper by Isabel Wagner and David Eckhoff that discusses over 80 privacy metrics and introduces categorizations based on the aspect of privacy they measure, their required inputs, and the type of data that needs protection. They also present a method on how to choose privacy metrics based on nine questions that help identify the right privacy metrics for a given scenario.
- Data Anonymization Tool - The Singapore PDPC has launched a free Data Anonymization tool to help organizations transform simple datasets by applying basic anonymization techniques.
Homomorphic Encryption
- Building Safe A.I.: A Tutorial for Encrypted Deep Learning - Blogpost on how to train a neural network that is fully encrypted during training.
- Microsoft SEAL - Microsoft SEAL is an easy-to-use open-source (MIT licensed) homomorphic encryption library developed by the Cryptography and Privacy Research group at Microsoft.
- nGraph-HE: A Graph Compiler for Deep Learning on Homomorphically Encrypted Data - Intel Research proposes an extension to its deep learning compiler to operate on homomorphically encrypted data.
- Google Fully-Homomorphic-Encryption - This repository created by Google contains open-source libraries and tools to perform fully homomorphic encryption operations on an encrypted data set.
- Palisade Homomorphic Encryption Software Library - An open-source project that provides efficient implementations of lattice cryptography building blocks and homomorphic encryption schemes.
Tokenization
- AWS Serverless Tokenization - Learn how to use Lambda Layers to develop a serverless tokenization solution in AWS.
Secure Multi-Party Computation
- Private Join and Compute - Google's implementation of the "Private Join and Compute" functionality. This functionality allows two users, each holding an input file, to privately compute the sum of associated values for records that have common identifiers.
Synthetic Data
- Data Synthesizer - DataSynthesizer generates synthetic data that simulates a given dataset.
- Faker - Faker is a Python package that generates fake data for you.
- Pynonymizer - Pynonymizer is a universal tool for translating sensitive production database dumps into anonymized copies.
- Synthetic Data Vault - The Synthetic Data Vault (SDV) enables end users to easily generate Synthetic Data for different data modalities, including single table, multi-table and time series data.
- Synthetic Data Generation: Quality, Privacy, Bias (Workshop at ICLR 2021) - Workshop on the intersection of challenges regarding quality, privacy and bias in synthetic data generation.
- synthpop - R package for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis.
- Synthea - An open-source, synthetic patient generator that models the medical history of synthetic patients.
- Presidio Evaluator - Data Generator - This data generator takes a text file with templates (e.g. my name is x]) and creates a list of InputSamples which contain fake PII entities instead of placeholders.
- Mimesis - Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.
- plaitpy - plait.py is a program for generating fake data from composable yaml templates.
Differential Privacy and Federated Learning
- A Friendly, Non-Technical Introduction to Differential Privacy - Blog post that provides simple explanations for the core concepts behind differential privacy.
- Differential Privacy at the U.S. Census Bureau - Video on how differential privacy is being implemented in the U.S. Census.
- Privacy-Preserving AI - Video on Privacy Preserving AI (Andrew Trask) | MIT Deep Learning Series
- Pysyft - PySyft is a Python library for secure and private Deep Learning.
- CrypTen - CrypTen is a framework for Privacy Preserving Machine Learning built on PyTorch.
- Opacus - A library that enables training PyTorch models with differential privacy.
- Uber SQL Differential Privacy - This repository contains a query analysis and rewriting framework to enforce differential privacy for general-purpose SQL queries. (deprecated)
- Google Differential Privacy Library - This repository contains libraries to generate ε- and (ε, δ)-differentially private statistics over datasets.
- IBM's Differential Privacy Library - Diffprivlib is a general-purpose library for experimenting with, investigating and developing applications in, differential privacy.
- Microsoft's SmartNoise - This toolkit uses state-of-the-art differential privacy techniques to inject noise into data, to prevent disclosure of sensitive information and manage exposure risk.
- NIST Differential Privacy Blog Series - This series is designed to help business process owners and privacy program personnel understand basic concepts about differential privacy and applicable use cases and to help privacy engineers and IT professionals implement the tools.
- FedML - FedML - The federated learning and distributed training library enabling machine learning anywhere at any scale. It's backed by FedML, Inc. Supporting large-scale geo-distributed training, cross-device federated learning on smartphones/IoTs, cross-silo federated learning on data silos, and research simulation. Best Paper Award at NeurIPS 2020.
- FedJAX - Google's JAX-based open source library for federated learning simulations that emphasizes ease-of-use in research.
- diffpriv: Easy Differential Privacy - R package that is an implementation of major general-purpose mechanisms for privatizing statistics, models, and machine learners, within the framework of differential privacy of Dwork et al. (2006).
- sdcMicro: Statistical Disclosure Control Methods for Anonymization of Microdata and Risk Estimation - R package that can be used for the generation of anonymized (micro)data, i.e. for the creation of public- and scientific-use files.
- PPRL: Privacy Preserving Record Linkage - R package that is a toolbox for deterministic, probabilistic and privacy-preserving record linkage techniques.
- PipelineDP - Write fast, flexible pipelines that use modern techniques to aggregate user data in a privacy-preserving manner.
- Practical Differential Privacy w/ Apache Beam - Blog post showing how to use Privacy on Beam from Google's differential privacy library.
- Computing Private Statistics with Privacy on Beam - This Google Developer Codelab walks through the use of Privacy on Beam to perform differentially private analysis.
- FLUTE - Created by Microsoft Research, Federated Learning Utilities and Tools for Experimentation (FLUTE) is a framework for running large-scale offline federated learning simulations.
- TensorFlow
- TensorFlow Privacy - Python library that includes implementations of TensorFlow optimizers for training machine learning models with differential privacy.
- TensorFlow Federated - TensorFlow Federated (TFF) is an open-source framework for machine learning and other computations on decentralized data.
- TensorFlow Encrypted - TF Encrypted is a framework for encrypted machine learning in TensorFlow.
- CrypTFlow - CrypTFlow is a system for end-to-end secure inference of deep neural networks written in TensorFlow.
- Four-Episode Podcast on Differential Privacy by This Week in Machine Learning and AI
- Episode of This Week in Machine Learning and AI Podcast:
Designing for Trust with Users
- Data Permissions Catalogue - Catalogue created by the data consultancy IF to help teams make decisions about how, when, and why to collect and use data about people.
- Privacy Patterns - UC Berkeley collection of design patterns attempting to standardize language for privacy-preserving technologies, document common solutions to privacy problems, and help designers identify and address privacy concerns.
- How to Protect Your Users with the Privacy by Design Framework - Developers can help to defend their users’ personal privacy by adopting the Privacy by Design (PbD) framework.
- The UX Guide to Getting Consent - Short guide by the International Association of Privacy Professionals (IAPP) about obtaining consent under the EU's GDPR.
- Creepiness-Convenience Tradeoff - As people consider whether to use the new "creepy" technologies, they do a type of cost-benefit analysis weighing the loss of privacy against the benefits they will receive in return.
- Building a Privacy Policy Users Actually Want to Read - Creation of a user-friendly privacy notice through privacy journeying and using a layered notice approach.
- Contract Design Pattern Library - Library of guidelines, explanations, and examples to inspire and support you in exploring user-friendly approaches to contract simplification and visualization.
- Privacy UX Series in Smashing Magazine:
- Lean Privacy Review - Carnegie Mellon University researchers developed a fast, easy method to catch privacy issues early in a system’s development process by gathering feedback from users.
Dark Patterns in Digital Services
- Dark Patterns - Dark patterns are tricks used in websites and apps that make you do things that you didn't mean to do.
- The Dark Side of UX Design - Practitioner-identified examples of stakeholder values superseding user values.
Tagging Personally Identifiable Information
- Managing Tags in AWS Resource Groups - Tags are words or phrases that act as metadata that you can use to identify and organize your AWS resources. A resource can have up to 50 user-applied tags.
- Categorizing Your AWS S3 Storage Using Tags - In addition to data classification, tagging offers benefits such as fine-grained access control of permissions and object lifecycle management.
- Detecting PII Using Amazon Comprehend - To detect entities that contain personally identifiable information (PII) in a document, use the Amazon Comprehend DetectPiiEntities operation.
- Quickstart for Tagging Tables in Google Cloud - Tutorial shows how to create a BigQuery dataset, copy data to a new table in your dataset, create a tag template, and attach the tag to your table.
- Using Policy Tags in Google Cloud's BigQuery - Use policy tags to define access to your data, for example, when you use BigQuery column-level security.
- Adding a Tag-Based PII Policy in Cloudera - How to add a PII tag-based policy. In this example, the author creates a tag-based policy for objects tagged "PII" in Atlas.
Regulatory Resources
- Global Comprehensive Privacy Law Mapping Chart - The IAPP's Westin Research Center has created this chart mapping several comprehensive data protection laws.
- US State Privacy Legislation Tracker - The IAPP Westin Research Center actively tracks the proposed and enacted comprehensive privacy bills from across the United States.
- Privacy in M&A transactions: The Playbook - The playbook is directed to mergers and acquisitions (M&A) and privacy teams to help identify potential privacy-related issues.
- European Data Protection Supervisor Website Evidence Collector - The Website Evidence Collector tool automates the collection of evidence of personal data processing, such as cookies, or requests to third parties.
- GDPR Developer Guide - In order to assist web and application developers in making their work GDPR-compliant, France's Data Protection Agency, the CNIL, has drawn up a guide of best practices.
- Data Protection/Privacy Mapping Project - Microsoft's Data Protection/Privacy Mapping Project facilitates consistent global comprehension and implementation of data protection with an open source mapping between ISO/IEC 27701 and global data protection and/or privacy laws and regulations.
- European Data Protection Board Guidelines 4/2019 on Article 25, Data Protection by Design and by Default - This document gives general guidance on the obligation of Data Protection by Design and by Default set forth in Article 25 in the GDPR.
- A Guide to Privacy by Design - This document by Spain's Data Protection Agency, AEPD, provides guidance on implementation of Privacy by Design into systems and applications.
Conferences
Miscellaneous
- The World of Geolocation Data (FPF Infographic)
- Data and the Connected Car (FPF Infographic)
- Microphones and the Internet of Things (FPF Infographic)
- GDPR – A Practical Guide For Developers
- W3C Self-Review Questionnaire: Security and Privacy
- Privacy is an Afterthought in the Software Lifecycle. That Needs to Change
- How Uber is Approaching Data Privacy Architecture
- Microsoft - Code with Engineering Playbook: Privacy Fundamentals
- Private AI - Privacy Enhancing Technologies (PETs) Decision Tree