5 Privacy Enhancing Technologies in AI systems
After completing this chapter, you will:
- Acquire the basic knowledge on privacy preserving machine learning with methods such as differential privacy, federated learning, and secure computations such as homomorphic encryption, and secure multiparty computation.
- Understand how synthetic data can help you create anonymous dataset and augmented the size of training data
- Evaluate the differences between PETs and what is important to consider when choosing the right approach
Privacy enhancing technologies (PETs) are a fundamental component of any digital system that is processing personal data. While the title of this chapter is PETs in AI systems, we will not cover basic techniques, such as k-anonymization and basic differential privacy that can be used in the preprocessing stage (see Appendix for a quick overview). Instead, our focus will be on PETs specifically designed for machine learning applications, emphasizing privacy-preserving techniques in the implementation of model training across various advanced methods. The main reference throughout this chapter is Chang et al. (2023).
5.1 The landscape of PETs
To cover extensively the landscape of PETs would require a book of its own. However, inspired by Garrido et al. (2022), there are various types of approaches to define a taxonomy of PETs along different layers that can involve secure computations, secure storage, communication, and more broadly governance and policies.
The layers defined in Garrido et al. (2022) are visualised in Figure 5.1. Secure storage or communication solutions are not covered in this book, instead we want to focus on those approaches that can modify the data along the MLops pipeline and these approaches can go under the general term of Anonymisation. Furthermore, these approaches, can be combined with Secured and outsourced computing where the actual computations or the computing environments can be further secured.
5.2 Privacy-Preserving Machine Learning Techniques
Privacy-preserving machine learning (PPML) encompasses a set of methods and technologies designed to enable the training and deployment of machine learning models on sensitive data while maintaining stringent privacy standards. PPML is essential when working with data that contains personally identifiable information or other sensitive details, ensuring that the data’s integrity and confidentiality are preserved throughout the AI lifecycle.
A simple approach to work with personal data during the training stage of an AI model is to implement machine learning tasks on fully anonymized datasets. By definition, truly anonymous data is not personal data, and usual AI systems development and deployment techniques can be used when there is no personal data. However, ensuring full anonymisation is a challenging task, if even possible: while it is possible to minimise some of the personal data, full anonymisation can destroy important features and reduce the value of the original data. In such cases, privacy-preserving machine learning techniques are employed to enable secure training on sensitive data without sacrificing performance or privacy.
Our focus in this chapter is on techniques that ensure privacy when training on non-anonymized data. These methods account for privacy concerns throughout the entire training and inference process.
5.2.1 Differential Privacy
The most important PPML technique is differential privacy (DP). DP is a privacy-preserving approach that provides mathematical guarantees that individual data points in a dataset cannot be re-identified. By introducing carefully designed noise to the data or model parameters, DP makes sure that the inclusion or exclusion of any single individual’s data does not significantly affect the overall performance of the analysis or AI model, preserving privacy even in sensitive datasets. While we introduced perturbation in the previous chapter as a privacy enhancing technique, differential privacy follows the same logic, however differential privacy can be added at each stage of the ML pipeline.
Some possible examples where DP can be applied: - Training Data: Noise is added directly to the dataset before training begins, limiting the model’s capacity to memorize exact details of individual records. This is what we called perturbation in the previous chapter. - Features: In some cases, DP can be applied to specific feature sets within the training data, particularly when sensitive information is embedded in particular features. For example features are extracted from images of faces, and the DP is applied to further minimise the data. - Model Parameters: Noise can also be injected into the model’s weights and gradients during training. This approach, known as differentially private stochastic gradient descent (DP-SGD), limits how much each individual data point can influence the model, thus protecting sensitive information throughout the learning process. Weights can also be modified after training. - Model Input/Output: While this chapter focuses on privacy preservation in the training and development stages, DP can also apply noise to the final model output or even to the input submitted by the user of the AI system.
More complex approaches are constantly being developed. For example Wu et al. (2023) propose an approach that focuses on private predictions rather than private training; their approach is useful when the model cannot be retrained (e.g. a LLMs like OpenAI’s GPT4) as it queries the model with different subsets of the sensitive data and aggregating the results locally.
However it is important to remember that while differential privacy provides robust privacy protections, it involves can involve at least two trade-offs to consider:
Computational overhead: The process of adding noise, particularly when applied to model parameters, increases the computational demands of model training. This added complexity can impact training speed and may require additional computational resources.
Risk of Reduced Model Accuracy: Differential privacy, especially when noise is added to gradients or weights, can impact model accuracy. This effect is more pronounced in deep learning models, where subtle variations in weights can lead to significant performance changes. The level of noise added must be carefully balanced.
5.2.2 Federated Learning
Federated learning (FL) is a PPML technique that enables training across decentralized devices or servers without transferring raw data to a central location. Instead of collecting data in one place, FL allows local devices to collaboratively train a global model by sharing only model updates, such as gradients or weights, while the actual data remains stored on each individual device. This approach ensures that sensitive information is not exposed outside its origin, significantly enhancing data privacy and security.
The federated process involves:
- Local Training: Each device or server/computing node trains the model on its own dataset.
- Parameter Sharing: Instead of data, only the local model parameters are sent to the central server.
- Aggregation: The server aggregates these parameters to update the global model.
- Iterative Improvement: This cycle is repeated multiple times until the model reaches a satisfactory level of accuracy.
Federated learning is widely used in sectors requiring stringent privacy standards, including:
- Healthcare: Medical institutions can collaboratively train predictive models across patient records without transferring sensitive health data.
- Finance: Banks and financial institutions can employ FL to share insights on fraud detection without exposing client information.
- Mobile Applications: FL allows companies like Google and Apple to improve smartphone AI features, such as predictive text and personalized recommendations, without uploading user data to a central cloud.
Despite its advantages, federated learning faces several challenges:
Communication Costs: As devices frequently communicate model updates with the central server, network bandwidth can become a bottleneck, especially in resource-limited environments.
Model Performance Degradation: The decentralized nature of federated learning can introduce data heterogeneity, where variations in data distributions across devices may impact model performance.
Security and Privacy Risks: While FL enhances privacy by keeping data local, vulnerabilities remain. Model updates may still reveal information about the training data, posing a risk for attacks like membership inference or model inversion. Techniques such as differential privacy and secure aggregation can be combined with FL to mitigate these risks.
5.2.3 Synthetic Data Generation
Synthetic data generation is a privacy-preserving approach that involves creating artificial datasets that mimic the statistical properties of real data without directly using or exposing any actual data points. This technique can serve as a powerful privacy-enhancing tool in machine learning, particularly when sensitive data cannot be shared or directly used for model training.
Synthetic data is typically generated using advanced machine learning models that learn and replicate the patterns and structures inherent in original datasets:
- Generative Adversarial Networks (GANs): GANs consist of two neural networks, a generator and a discriminator, that work together to produce synthetic data. The generator creates new data points, while the discriminator evaluates how similar the generated data is to real data, iteratively improving the quality of synthetic data. See Tanaka and Aranha (2019).
- Variational Autoencoders (VAEs): VAEs are another generative model that learns to encode data into a latent representation and then decodes it, generating new synthetic data points. VAEs are especially useful when generating data with complex structures or high-dimensional features.
- Differentially private synthesis models: By incorporating differential privacy mechanisms, these models add controlled noise during the data generation process, ensuring that the synthetic data remains statistically accurate while providing strong privacy guarantees.
Synthetic data generation is widely used across various fields where sensitive data is involved, and privacy concerns are high: for example in medical research synthetic data can be analysed and openly shared without compromising patient confidentiality. In retail and e-commerce, customer behaiouvr could be modelled with synthetic data. Synthetic data can also be used to enhance data diversity and mitigate possible biases that are in the data. Finally, in many scenarios lacking sufficient real-world data, synthetic data can be used to train and test AI systems, by simulating rare or high-risk situations not easily captured through traditional data collection.
Synthetic data generation has considerable advantages, but it also involves trade-offs:
- Data utility vs. privacy: While synthetic data retain essential statistical properties of the original dataset, it may not fully capture all nuances of the real data, potentially limiting model accuracy and reliability.
- Risk of re-identification: Synthetic samples could inadvertently resemble original data too closely, posing re-identification risks. Differential privacy and a strict validation processes help mitigate these risks, but depending on the case it could still be challenging to classify synthetic data as truly anonymous data.
- Computational overhead: High-quality synthetic data generation, particularly with GANs or VAEs, requires substantial computational resources, adding complexity to the data preparation and model training process.
The “Habsburg AI problem” refers to a challenge in synthetic data generation where the synthetic dataset unintentionally replicates distinctive, identifying features of the original dataset. This is named after the Habsburg jaw, a hereditary trait of the Habsburg royal family in Europe, which could be used to identify members within a population. In the context of privacy-preserving machine learning, if synthetic data too closely resembles specific individuals from the original data—akin to reproducing the “Habsburg jaw”—it risks re-identification, undermining privacy guarantees.
To address the Habsburg AI problem, synthetic data generation processes need to ensure that generated data reflects broader patterns without capturing rare or unique traits that could lead to identification of individuals. Techniques such as controlling similarity metrics or applying differential privacy to synthetic data models can mitigate this risk and improve the privacy-resilience of synthetic data.
Recently this issue has gained attention in the ML community in a paper by Shumailov et al. (2024). The paper highlights the phenomenon of “model collapse,” where generative AI models trained recursively on data generated by previous models progressively lose fidelity to the original data distribution. Over successive generations, the models forget low-probability (tail) events and converge to narrower, less diverse outputs, leading to irreversible degradation of performance. The findings stress that maintaining access to original, human-generated data is critical for avoiding these issues. While synthetic data seems very promising, developers should avoid relying solely on model-generated data during training or at least should establish robust mechanisms for verifying and maintaining the quality and diversity of synthetic training datasets.
5.2.4 Methods based on secure computations: Homomorphic Encryption & Secure Multiparty Computation
The methods described above broadly belong to the category of “Anonymisation” in the taxonomy by Garrido et al. (2022). Those techniques can also be combined with methods that related to the security of the computations, rather than the privacy that is introduced by manipulating the data.
Homomorphic encryption (HE) is a cryptographic technique that allows computations to be performed directly on encrypted data without requiring decryption. This capability is especially valuable in privacy-preserving machine learning, as it enables data processing in cloud environments and other external systems without exposing the underlying data. Homomorphic encryption is based on the principle that mathematical operations performed on encrypted data provide an encrypted result that, when decrypted, matches the result of operations as if they were performed on the plaintext data. Despite its advantages, homomorphic encryption has a few limitations, mostly due to its performance overhead (slower compute times) and increase complexity.
Somewhat related to federated learning, secure multiparty computation (SMPC) is another cryptographic approach that allows multiple parties to jointly compute a function over their combined inputs without revealing those inputs to each other. SMPC however differs from federated learning in several important ways:
- Data Distribution and Control: In federated learning, the model is trained locally on each participant’s device, and only the model parameters are shared with a central server. In contrast, SMPC does not involve local model training; instead, each participant’s data is split and distributed across multiple parties in a way that prevents any single party from reconstructing the original data.
- Joint Computation Model: While federated learning focuses on distributed model training across local datasets, SMPC is used for joint computations where all parties can contribute data without central aggregation. This enables SMPC to be highly secure for computations that require collaboration without data centralization.
- Privacy Guarantees: Both SMPC and federated learning aim to preserve privacy, but SMPC offers cryptographic guarantees that no individual party can view another’s data, even indirectly. In federated learning, privacy is maintained by restricting data access and centralizing only model updates.
Similarly to HE, SMPC can also suffer from increased computational complexity (more resources needed and longer computing times) and the additional communication costs: Securely sharing data fragments and conducting joint computations involve substantial communication overhead, which can affect performance, especially in real-time or large-scale applications.
5.3 Comparison of Privacy-Preserving Techniques
Each privacy-preserving machine learning (PPML) technique discussed offers unique strengths and trade-offs. The following table compares federated learning, differential privacy, homomorphic encryption, secure multiparty computation, and synthetic data generation based on privacy guarantees, computational cost, use cases, and limitations.
| Technique | Privacy Guarantees | Computational Cost | Use Cases | Limitations |
|---|---|---|---|---|
| Differential Privacy | High - Adds noise to data, features, or model outputs to ensure individual data cannot be re-identified | Moderate to High | Government data releases, customer data analysis | Privacy-utility trade-off; may reduce model accuracy due to added noise |
| Federated Learning | Medium - Only model parameters are shared, data remains local | Moderate | Mobile apps, finance, healthcare | Vulnerable to inference attacks, requires frequent communication between devices |
| Synthetic Data Generation | High - Protects privacy by generating data that mirrors real data patterns without actual information | Moderate | Medical research, finance, regulatory reporting | Risk of re-identification if synthetic data closely resembles real data, potential reduction in data utility |
| Homomorphic Encryption | Very High - Allows computation on encrypted data without decryption | Very High | Secure cloud computation, medical imaging | Computationally intensive, high latency, challenging for real-time applications |
| Secure Multiparty Computation | Very High - Ensures data confidentiality by distributing data fragments for joint computation | High | Cross-institutional healthcare research, fraud detection | High communication overhead, complex implementation, requires trusted setup |
Each technique contributes uniquely to a comprehensive PPML system. Federated learning prioritizes data locality, differential privacy focuses on noise injection for individual privacy, homomorphic encryption allows secure computation in untrusted environments, secure multiparty computation enables collaborative computation without central
5.4 Privacy-preserving techniques in the AI development lifecycle
Embedding PETs throughout the AI development lifecycle is essential to ensure data protection at every stage. Different PETs offer unique advantages and might be useful only on specific phases of AI development. Choosing the appropriate PETs depends heavily on the specific context in which an AI system is deployed. Combining multiple PETs can often achieve a balance between privacy and usability. If you are unsure which PET you want to test first, differential privacy is the safest bet.
Consider the processing blocks of Figure 3.3: which of the presented PETs can be added to each processing block?
Note for the instructor: this can be a lengthy exercise that can also be assigned as homework. There is not a simple solution, since differences also depend on the type of data / AI system that one is considering. The instructor can also assign different systems to different groups of students based on the property of the data (e.g. one group considers an AI system that is processing tabular data, another group could work with text data, another with images, and so on).
5.5 Summary
In this chapter we covered the most important PETs that can be used when training AI models with sensitive data. Some of these approaches can also be adopted during the production phase of an AI system, i.e. - even if you are not training an AI model from scratch - techniques such differential privacy can also ensure that input data (e.g. prompts of an LLM AI system) is also transformed before querying the AI model. This chapter also concludes module 2, at this stage the training data is minimised - and sometimes even fully anonymised. In the next module we will focus on the development and deployment stage of the AI model and AI system.
| Question | Options |
|---|---|
| 1. What is the main goal of privacy-preserving machine learning (PPML)? | 1) To improve model accuracy. 2) To train and deploy AI models on sensitive data while maintaining privacy. 3) To centralize data for model training. 4) To reduce computational costs of training. |
| 2. Which of the following describes federated learning? | 1) A method to anonymize training data before model development. 2) A decentralized approach where raw data stays local and only model updates are shared. 3) A technique for adding noise to model parameters. 4) A cryptographic method for secure computations on encrypted data. |
| 3. What is a key challenge associated with federated learning? | 1) Ensuring data quality. 2) High communication costs due to frequent model updates. 3) Centralized data storage. 4) Lack of privacy protection for model parameters. |
| 4. Differential privacy protects individual data by: | 1) Encrypting all data during processing. 2) Adding carefully calibrated noise to data or model parameters. 3) Splitting data into shares distributed across multiple parties. 4) Generating synthetic data. |
| 5. What is the primary limitation of homomorphic encryption in machine learning? | 1) It cannot handle computations on encrypted data. 2) It is computationally expensive and introduces latency. 3) It requires centralized data storage. 4) It is incompatible with differential privacy. |
| 6. Secure multiparty computation (SMPC) differs from federated learning by: | 1) Training models locally on participant devices. 2) Using cryptographic protocols for joint computation without central aggregation. 3) Adding noise to protect data privacy. 4) Generating synthetic datasets for privacy. |
| 7. Which PPML technique generates artificial datasets mimicking real data? | 1) Differential privacy 2) Homomorphic encryption 3) Synthetic data generation 4) Secure multiparty computation |
| 8. A major risk of synthetic data generation is: | 1) High computational cost. 2) Data re-identification if synthetic data closely resembles real data. 3) Inability to capture statistical patterns of the original dataset. 4) Lack of support for high-dimensional data. |
| 9. Which PPML technique is most suitable for distributed data across multiple hospitals? | 1) Homomorphic encryption 2) Federated learning 3) Synthetic data generation 4) Secure multiparty computation |
| 10. What does the “Habsburg AI problem” refer to? | 1) Challenges in training models with encrypted data. 2) Synthetic data replicating unique traits from the original data, risking re-identification. 3) Communication bottlenecks in federated learning. 4) Loss of statistical utility in synthetic data. |
Click to reveal solutions
Answer: 2) To train and deploy AI models on sensitive data while maintaining privacy.
Explanation: PPML ensures privacy throughout the AI lifecycle while enabling the use of sensitive data.
Answer: 2) A decentralized approach where raw data stays local and only model updates are shared.
Explanation: Federated learning trains models locally and shares updates instead of data, enhancing privacy.
Answer: 2) High communication costs due to frequent model updates.
Explanation: Frequent parameter sharing between devices and the server can increase network overhead in federated learning.
Answer: 2) Adding carefully calibrated noise to data or model parameters.
Explanation: Differential privacy uses noise to protect individual data while maintaining overall utility.
Answer: 2) It is computationally expensive and introduces latency.
Explanation: Homomorphic encryption is resource-intensive, limiting its use in real-time applications.
Answer: 2) Using cryptographic protocols for joint computation without central aggregation.
Explanation: SMPC ensures privacy by distributing data fragments and performing secure joint computations.
Answer: 3) Synthetic data generation
Explanation: Synthetic data generation creates artificial datasets that mimic real data for privacy-preserving purposes.
Answer: 2) Data re-identification if synthetic data closely resembles real data.
Explanation: Synthetic data can inadvertently replicate real data patterns, risking privacy breaches.
Answer: 2) Federated learning
Explanation: Federated learning allows distributed training across hospitals without sharing sensitive patient data.
Answer: 2) Synthetic data replicating unique traits from the original data, risking re-identification.
Explanation: The Habsburg AI problem highlights privacy risks when synthetic data too closely mirrors original datasets.