4 Personal data management in AI systems in practice

Learning outcomes

After completing this chapter, you will:

Understand the key components of data management and data governance when dealing with personal data and how they impact AI systems, from data collection to data management.
Recognize the importance of data preprocessing for minimising personal data and all other types of personal data processed in the AI system beyond the data used for training the AI model.
Learn about the challenges and techniques in data versioning, transparency, and managing personal data throughout the AI lifecycle.

In this chapter, we will explore the various aspects of (personal) data management and data preparation in AI systems. This corresponds to the stage “data engineering” in the typical MLOps workflow. The figure below is extracted from figure Figure 3.3.

Figure 4.1: **The data engineering stage of the secMLOps workflow** Please refer to the text, for a detailed explanation on the figure.

Let’s now expand on all the elements of the figure Figure 4.1

4.1 Forming the “Raw Data”

The Raw data — are primary source of data and are often collected from diverse sources. This data can come from internal sources within a company, such as customer databases or transaction records. It might also include registry data from external organizations like health institutions, which can provide structured and reliable datasets. Additionally, data can be sourced from the web through techniques like web scraping, gathering vast amounts of information from publicly available sources. This collected information forms what is known as “raw data,” a foundational element used to train AI models. Let’s consider the various ways of gathering the raw data.

4.1.1 Data acquisition

The initial phase of data flows within an MLOps workflow is data acquisition, which involves gathering raw data for subsequent processing and use in machine learning systems. According to ISO/IEC 22989:2022, data acquisition can be categorized into three main types: 1) first-party data collection, 2) third-party data collection, and 3) data querying.

4.1.1.1 First-party personal data collection

First-party data collection refers to the process where personal data is collected directly by the organization. This typically involves data for which the organization acts as the data controller, maintaining a direct relationship with the data subject. For instance, this could include customer data collected through an organization’s own platforms or services. While the direct control over data provides clarity regarding its origin, it is crucial to note that reusing such data for machine learning purposes still requires a legal basis under the General Data Protection Regulation (GDPR). Organizations may rely on explicit consent from the data subject or invoke other legal bases, such as legitimate interests. However, the appropriate legal basis for reusing personal data for training machine learning systems remains a subject of ongoing debate within the European Union, with additional guidelines anticipated to provide further clarity.

4.1.1.2 Third-party personal data collection

*Third-party data acquisition involves obtaining data from external sources, such as vendors, aggregators, or partnerships. Organizations using this data to train machine learning models must ensure transparency and legal compliance. This includes verifying that the third-party provider has obtained valid consent or adheres to other lawful bases for processing personal data under GDPR.

4.1.1.3 Data querying

The third approach to data acquisition, as outlined in ISO/IEC 22989, is data querying. This involves retrieving data by performing queries and combining datasets, which may include both first-party and third-party data sources. Data querying is a powerful technique that enables the integration of diverse datasets to create richer, more comprehensive data for machine learning purposes.

However, in the context of personal data, data querying introduces unique challenges and risks under the GDPR. Specifically, combining datasets containing personal data can inadvertently lead to the exposure of additional personal information. This occurs when the combination of datasets reveals insights or details that were not part of the original purpose for which the data was collected. For example, linking datasets from different sources could allow the identification of individuals or the inference of sensitive information that was not initially intended to be processed.

The GDPR emphasizes the importance of minimizing risks related to such activities. While the regulation does not explicitly prohibit data querying, it requires organizations to carefully evaluate and mitigate risks through measures such as data protection impact assessments (DPIAs), purpose limitation, and ensuring the compatibility of new processing activities with the original legal basis. Organizations must remain vigilant to prevent unintended data disclosures and ensure that the merged data remains compliant with privacy principles.

Data querying, while valuable for improving machine learning workflows, comes with the need for robust data governance practices to balance innovation with the protection of fundamental rights.

4.1.2 Other data from other stages of the AI system/model development

In the MLOps workflow, other personal data obtained from other stages of the workflow can be integrated into the raw data. One common scenario involves generated data from users of the AI system that is being developed. For instance, during the use of an AI system, user inputs and outputs may be monitored and subsequently incorporated into raw data for future training or fine-tuning. While this practice can improve model accuracy and relevance, organizations must ensure transparency and obtain valid consent or establish a lawful basis for this processing.

Another significant source of personal data comes from annotators in workflows such as reinforcement learning from human feedback (RLHF). Annotators play a critical role in refining AI models, particularly after the pre-training stage of large language models. During the annotation process, their feedback—often textual or descriptive—may inadvertently include personal data. For example, annotations could reveal identifying details about the annotators themselves or the content they describe. If this data is integrated into the raw dataset, organizations must carefully assess whether it complies with GDPR requirements and ensure that annotators’ rights are respected.

A third source of personal data is external fine-tuning datasets. Fine-tuning involves adapting pre-trained models to specific use cases using additional datasets. These datasets, often acquired from external sources, may contain personal data not present in the original raw data.

Note: Consent management

Consent management plays a critical role in ensuring that the processing of personal data in MLOps workflows complies with data protection regulations, particularly the GDPR. ISO 27560 provides a comprehensive framework for consent management, detailing how organizations can manage consent dynamically, especially in machine-actionable formats. These tools enable tracking, updating, and maintaining consent for all data subjects whose personal data contributes to the raw datasets used in machine learning systems. While obtaining consent might not be feasible for all the data-subjects involved in the AI system development, it is important to understand the different group of data subjects involved:

Data subjects from first-party data collection: These are individuals whose data is collected directly by the organization, where the organization acts as the data controller and has direct interactions with the data subjects.
Data subjects from third-party data collection: These include individuals whose data is obtained from external sources, such as data vendors or partnerships. Here, the organization must ensure that third-party data providers have processed the data lawfully.
Users of the AI system: When monitoring an AI system’s usage, the inputs and outputs of its users may be aggregated into the raw data for retraining or fine-tuning. These users are also data subjects whose consent must be appropriately managed.
Annotators: Annotators involved in processes like reinforcement learning from human feedback contribute data that might contain personal information. As data subjects, their consent for the inclusion of their annotations must also be managed.
Data subjects from fine-tuning datasets: Additional datasets used for fine-tuning AI models may involve yet another group of data subjects, requiring organizations to evaluate and manage consent for this specific context.
Optional: in certain AI systems, other external data is combined during system use. For example with generative AI with large language model, the so-called RAG pipelines might also include yet another dataset containing personal data from individuals, and its use also requires an appropriate legal basis.

Given the diversity of data subjects and the dynamic nature of consent, organizations face significant challenges in implementing effective consent management. Some organizations attempt to simplify this process by relying on legal bases other than explicit consent, such as legitimate interest. However, this approach is contentious, particularly when consent is implied or when organizations default to an opt-in system without explicitly informing the data subjects.

To address these challenges, aligning with ISO 27560 could help organisations properly manage the consent of the various data subjects involved. These tools enable transparency, ensure data subjects can easily opt out or withdraw consent, and support compliance with the GDPR’s principles of accountability and data minimization. Furthermore, adopting a default opt-out mechanism, where data subjects explicitly decide to participate, is often seen as a more ethical and transparent approach to consent management. Finally, with complex machine learning models such as deep neural networks, withdrawing consent does not automatically mean that the AI model is not storing some data about the data subject who has withdrawn. This challenge is described in the advanced cases section “Machine Unlearning”.

Note: How does web-scraping fit into the picture?

Considering the categorisation of the various types of data that can form the initial raw-data in an MLOps pipeline, it is difficult to assess if scraping is a form of 3rd party data, or data from queries. Web scraping refers to the automated extraction of data from websites, often without notice or consent from the individuals whose personal data is involved. This practice has become foundational for the digital economy, especially for AI development, enabling large-scale data collection at low cost. In the article “The Great Scrape: The Clash Between Scraping and Privacy” (Solove and Hartzog 2025) the authors argue that scraping often conflicts with fundamental privacy principles and laws, particularly when personal data is involved.

Several challenges arise when using web-scraped data for AI training. First, the practice typically violates key principles of privacy law, such as fairness, transparency, consent, and data minimization, as emphasized by frameworks like the GDPR. Scrapers frequently collect data without informing individuals, specifying its intended use, or offering the option to opt out. This undermines individual rights and introduces privacy risks. Second, web scraping of personal data often lacks a proper legal basis under the GDPR. Personal data made publicly available online does not equate to consent for its reuse. The indiscriminate collection of such data, especially for high-risk applications like facial recognition, can lead to privacy violations, increased surveillance, and misuse.

4.2 Pre-processed data

The next type of data encountered in the MLOps workflow is pre-processed data, which is derived from raw data after undergoing a data pre-processing stage. When dealing with personal data, this stage is not simply about cleaning, formatting, or rearranging the data. It often requires the application of privacy-enhancing technologies (PETs) and techniques to ensure compliance with data protection regulations, by mitigating risks such as data re-identification, unauthorized access, and potential misuse.

In this section, we will briefly explore some of the key privacy-enhancing techniques described in chapter 7 of the book “Privacy Preserving Machine Learning” by Chang et al. (2023).

4.2.1 Data Sanitisation in the Pre-Processing Stage

In the data pre-processing stage, particularly when handling personal data, this step is often referred to as data sanitization. The goal of data sanitization is to reduce the risk of re-identification by removing or transforming both direct identifiers (e.g., names, social security numbers) and quasi-identifiers.

Quasi-identifiers are pieces of information that, while not uniquely identifying on their own, can be combined with other datasets to identify individuals. Examples include attributes like date of birth, gender, or ZIP code, which, when cross-referenced with other data, could reveal the identity of a data subject.

The presented methods for data sanitisation are: generalisation, suppression, perturbation, anatomisation.

4.2.1.1 Generalisation

Generalization involves replacing specific values in a dataset with more general attributes to reduce the granularity of the data. This technique ensures that individual records become less identifiable while retaining the utility of the data.

For example, a numerical value, such as a person’s salary of €45,000, can be replaced with a range, such as €40,000–€50,000. Similarly, categorical data can also be generalized. For example, the specific occupation “software engineer” could be generalized to “information technology professional” or further generalized to simply indicate “employed” versus “unemployed”. Generalization is particularly useful when handling datasets with quantitative values or categorical attributes that could serve as quasi-identifiers. A common algorithm used in generalisation is k-anonymity, where it can be ensured that k-1 subjects share the same generalised feature in the dataset. When trying to k-anonymise multiple quasi-identifier at once, it might not be possible to fulfil the desired level of k, in that case suppression is a better technique to adopt.

4.2.1.2 Suppression

The second technique for personal data sanitization is suppression, which focuses on completely removing specific items or attributes from a dataset. While generalization replaces detailed data with broader categories, suppression eliminates data elements entirely, making them unavailable for future stages of the MLOps workflow.

Suppression is often applied to direct identifiers, such as names, social security numbers, or other sensitive information that could immediately identify an individual. For instance, in a dataset of hospital medical records, identifiers like names or patient IDs may be suppressed by removing the column containing this data entirely. If such identifiers are embedded within text, techniques like Named Entity Recognition (NER) can be used to detect and mask these identifiers within unstructured text fields.

The implementation of suppression varies depending on the data type, as direct identifiers take different forms across datasets. Here a few examples:

Tabular Data: Direct identifiers such as names or IDs can be dropped as columns or replaced with null values.
Text Data: Using NER techniques, names, surnames, or other identifiers can be identified and redacted, replacing them with placeholders or removing them altogether.
Images or Videos: For visual media containing people, direct identifiers like faces can be masked by applying techniques such as pixelation, blurring, or covering faces with black squares. However, it’s important to recognize that suppression in this case may not fully anonymise the data. For example, an individual might still be identifiable by their gait (walking style), clothing, unique tattoos, or other distinguishing features.
Medical Imaging: In datasets such as MRI scans, some direct identifiers are present in the metadata of the files (e.g. patient ID) and the images themselves might include facial features. A common suppression technique here is de-facing, which involves obscuring facial structures to prevent identification while retaining the medically relevant parts of the scan.

While suppression reduces the risk of direct re-identification, it does not guarantee full anonymity.

4.2.1.3 Perturbation

Perturbation is another key data sanitization technique, designed to transform individual records while preserving the overall statistical properties of the dataset. By introducing randomness or noise, perturbation minimizes the risk of re-identification while retaining the dataset utility.

Perturbation techniques replace original data values with altered or generated ones, ensuring that individual records cannot be easily linked back to specific data subjects. This transformation is often achieved through:

Noise Addition: Modifying data by introducing random noise, either additively or multiplicatively, to distort the original values. For instance:

Additive noise: Adding random values to numerical data, such as increasing or decreasing ages or salaries by a small random amount.
Multiplicative noise: Scaling values by a random factor, such as slightly adjusting percentages or measurements.

Synthetic Data Generation: Building a statistical model based on the original data and generating a synthetic dataset that mirrors the statistical properties of the original. This approach creates “fake” records that cannot be traced back to real individuals but still reflect patterns in the original dataset.
Data Swapping: Exchanging attributes between records within the dataset. For example, in tabular data, attributes like age or gender can be shuffled between records to unlink specific identifiers from their original context. This adds uncertainty while preserving the dataset aggregate patterns.

One application of perturbation is differential privacy, a framework for introducing noise in a controlled manner to provide strong guarantees of privacy. Differential privacy ensures that the inclusion or exclusion of any individual in the dataset does not significantly impact the results of data analysis. What we will see in later chapters is that differential privacy can actually be applied at all stages of the MLOps workflow making it a powerful technique to ensure security for all data and computations happening in the AI system lifecycle.

Perturbation, like all other data sanitisation techniques, comes with the limitation of carefully balancing between privacy and data utility. Overly aggressive noise addition or data transformation can render the data less useful for analysis or modeling, while insufficient perturbation may fail to provide adequate privacy protection.

4.2.1.4 Anatomisation

A further sanitization method is *anatomisation**. With anatomisation the goal is to divide sensitive attributes and quasi-identifiers into two separate datasets. The goal is to make it more difficult to link individual records together and re-identify the individual subject, while the original values remain the same. So for example a tabular dataset with columns “age, post code, gender, diagnosis” could be split into two unlinked tabular datasets (one with only “age and postcode” and the other with “gender and diagnosis”). Similar limitations as discussed for other sanitisation techniques are also applicable to this case.

Exercise 4.1: Hands-on data anonymisation

Let’s reuse the open dataset available with the fantastic open book “Programming Differential Privacy” (Near and Abuah 2021).

The data are 1000 subjects from the US Adult census data available as CSV file (url: https://programming-dp.com/ch1.html#preliminary). Your task is to apply some of the techniques listed in this section. You can use a spreadsheet program, or Python programming language with a dataframe library like Pandas or Polars.

Exercise 4.2: Understanding anonymisation through re-identification

Consider the same dataset as in Exercise 4.1 and remove the columns that can be used as a unique identifier (e.g. Name, DOB, SSN, Zip) and keep those that describe the data subjects with quasi identifiers (e.g. Workclass, Education, Marital status, Occupation, Race, Country).

Which combinations of values form a fingerprint for one of the 1000 subjects in the dataset? For example if you filter with Race == Amer-Indian and country == Mexico there is only one person with these characteristics in the dataset. Can you find more combinations?

4.2.2 Other data pre-processing techniques: filtering, normalisation, imputation

In addition to privacy-enhancing technologies, other data pre-processing techniques are essential when preparing data for AI models. Here we will cover those highlighted in ISO/IEC 22989.

4.2.2.1 Filtering

Filtering involves selecting or excluding specific data points based on predefined criteria. This process helps refine the dataset to ensure it aligns with the goals of the machine learning model being developed. When working with personal data, filtering serves several purposes:

Removing Outliers: Outliers are extreme data points that deviate significantly from the rest of the dataset. In machine learning, such data points can distort the model’s training process, reducing its ability to generalize to the broader population. Filtering out these outliers helps improve the model’s performance and reliability.
Ensuring Data Quality: Filtering can also be used to exclude incomplete, inconsistent, or erroneous data records. For instance, records with missing values in critical attributes or data points that fail validation checks may be filtered out to maintain the integrity of the dataset.
Reducing Bias: Filtering can help mitigate biases in the raw data by ensuring a balanced representation of different groups or attributes. This is especially important when training models that interact with sensitive personal data to avoid perpetuating or amplifying societal inequalities.
Removing unwanted content. Sometimes the raw data might contain content that should not be processed further. For example filtering could be used to remove any content with images of children.
Improving the signal to noise ratio. Finally, filtering can also be seen in the context of filtering out noise to improve the quality of the raw-data (e.g. remove audio background noise to improve speech recordings).

When it comes to the limitations of filtering, as with other preprocessing techniques, there is always the risk of loss of information, especially if filtering is applied aggressively. Filtering can also result in unintentional bias (filtering of outliers or specific groups may inadvertently introduce bias, skewing the model’s predictions) and ethical consideration also apply: decisions on what constitutes an “outlier” or “irrelevant data” must be transparent and justifiable, particularly when dealing with sensitive personal information.

4.2.2.2 Normalization

Normalization is another key technique used in data pre-processing. It addresses issues of skewed distributions or bias in the dataset by standardizing input data. For instance:

Numerical data can be scaled to fit a consistent range or transformed to align with a normal distribution.
Features such as age or income can be adjusted to eliminate outliers or imbalances that may skew model training.

While normalization is not unique to personal data, it plays an essential role in ensuring fairness and reducing bias in datasets, supporting the creation of models that are more equitable and representative.

4.2.2.3 Imputation

The final technique in the preprocessing stage is imputation, which deals with handling missing values in the dataset. Missing data is common in real-world datasets, particularly in personal data where individuals may choose not to provide certain information or where errors occur during data collection.

Imputation involves replacing missing values with plausible estimates to maintain the dataset’s completeness. Common approaches include:

Mean or Median Imputation: Filling missing numerical values with the mean or median of the corresponding feature.
Predictive Imputation: Using machine learning models to predict and replace missing values based on other features in the dataset.
Nearest Neighbor Imputation: Filling missing values based on the nearest neighbors’ data.

Imputation ensures that incomplete records are not discarded, which is critical for preserving the integrity and representativeness of the dataset, especially when dealing with small or specialized datasets involving personal data.

4.3 Data analysis stage: features extraction, data transformation data augmentation

While in some cases pre-processed data might be enough to form the dataset that is used in training and validating a ML model, depending on the ML architecture, the data analysis stage is also crucial when additional steps such as feature engineering, labeling, or augmentation are required.

In the data analysis stage the pre-processed data is input and produces features and augmented data as output. These outputs are then passed to the data preparation stage, where they are split into training, test, and validation datasets. Here we cover an overview of possible methods that improve and expand the pre-processed data, before the actual training of a ML model.

Data extraction or “features extraction” involves identifying and isolating relevant information from the pre-processed dataset to create meaningful features for the model. This step reduces dimensionality and focuses on the most informative parts of the data.
Data transformation modifies data into a format or structure suitable for analysis. This may involve operations such as scaling, encoding categorical variables, transforming time-series data into frequency domains, converting speech into transcribed text.
Data labeling assigns labels or annotations to the data, a necessary step for supervised learning tasks. Labeling can be performed manually, semi-automatically, or using pre-trained models. For example a dataset of medical images could be annotated by radiologists to identify which pixels in the images are those related to the diagnosis.
Data augmentation generates additional data by applying transformations to existing data, which can increase the diversity of the dataset without collecting new data. Common techniques include flipping, rotating, or cropping images, or introducing noise into textual or numerical data. For example with a dataset of anonymized handwritten text, augmentation could involve rotating or skewing the characters slightly to mimic different handwriting styles.
Data synthesis creates entirely new data points using statistical models or generative techniques. This is especially useful for generating training data when the original dataset is limited or sensitive.

It is important to mention that there can still be residual privacy risks after the data analysis stage. Sometimes, during the data analysis stage, the combination of possibly unlinked features in the pre-processed data, might reveal new features that could lead to the re-identification of individuals or more in general the disclosure of sensitive information that was not visible in the pre-processed dataset. It is important to perform privacy audits also on the extracted data, and eventually re-apply the same data sanitisation techniques that were used to process the raw data.

4.4 Data preparation

The final stage of the data engineering part of the MLOps is data preparation. This stage takes the features and augmented data generated during the data analysis stage and splits them into three distinct datasets: training data, test data, and validation data.

The training data is the largest subset of the prepared data, used to train the machine learning model by enabling it to learn patterns and relationships between features and target outputs. During training, the model iteratively adjusts its parameters to minimize the error between predictions and actual values using the training data.
The test data is a separate dataset that is withheld during training and used exclusively to evaluate the model’s performance after training is complete. The test data provides an unbiased estimate of the model’s generalization ability, ensuring it performs well on unseen data.
The validation data is a dataset used during model development to fine-tune hyperparameters and prevent overfitting. This dataset is typically used in conjunction with techniques such as cross-validation or grid search to optimize model performance. While the validation data is also used during the model training stage, it should remain independent to ensure unbiased hyperparameter tuning.

4.4.1 Splitting Techniques

Splitting the data into these subsets requires careful consideration to maintain integrity, privacy, and utility:

Random Splitting: Data is randomly divided into training, test, and validation sets. While simple, this approach works best when the data is uniformly distributed and free of inherent biases.
Stratified Splitting: If the dataset is imbalanced (e.g., in class labels), stratified splitting ensures that the proportions of different classes are preserved across the training, test, and validation sets. This is particularly important in datasets containing sensitive personal data, where representation can affect fairness.
Temporal Splitting: For time-series data, splitting is often done based on time order to prevent information leakage from the future into the training data.

4.5 Data after training and after deployment

We have covered the various types of data that ultimately shape the weights of an AI model. This data is primarily used during the pre-deployment phase to train and fine-tune the model. However, the data lifecycle in AI systems extends beyond deployment. When the AI model is in actual use, new data inputs are introduced as users interact with the system.

In this stage, personal data often plays a role. Depending on the type of AI system, users may provide personal data as input to prompt the system to perform a specific task. These inputs can range from images and text to other data types, each potentially containing sensitive information. The challenge here lies in managing these inputs responsibly. Depending on the context, user inputs might need to be minimized to protect personal data. For example, with medical images, some identifiers can be removed to protect patient privacy, but the essential diagnostic details must remain intact. In other cases, such as when using a chatbot for customer support, the AI system can employ filtering mechanisms to minimize personal identifiable information, ensuring that only necessary data is processed.

4.6 Good Data Management Practices

When data is the critical component of a system, adopting good data management practices is essential. This not only ensures the system’s reliability and reproducibility but also addresses cybersecurity concerns, especially when handling sensitive or personal data. Below, we expand on key areas of data management with examples and references to best practices.

4.6.1 Data Quality and Data Appraisal

Data quality refers to the suitability of data for its intended purpose, which includes attributes such as accuracy, completeness, consistency, and relevance. High-quality data is essential for building reliable and fair machine learning models. It is an open question whether data quality is more important than data quantity and it is impossible to come to a conclusion that fits all various types of AI systems.

Bias Audit and Fairness Testing: Regular audits should be conducted to identify and mitigate biases in the data. For example, fairness testing can be applied to ensure equitable representation of all demographic groups in the training dataset.
Data Appraisal: Deciding what data to keep and what to reject is a crucial step. Irrelevant, redundant, or outdated data should be excluded, especially if it poses privacy risks or contributes to bias.

4.6.2 Data Versioning

Tracking the evolution of data is essential for ensuring reproducibility in machine learning systems. Reproducibility means that given the same data and the same code that was used to train a certain AI model in the past, it is possible to re-train the same model and obtain exactly the same model weights. With data versioning we are able to create snapshots of the data at a given moment in time. In the context of working with personal data this is even more important: a data subject might request to be removed from the original training dataset. With data versioning we can ensure that the new version of the data does not contain anymore the data from the data subject.

Solutions for Data Versioning: Tools like git annex or DVC (Data Version Control) provide mechanisms to version and track changes in large datasets.
When Data Cannot Be Versioned: For datasets that cannot be directly versioned due to size or sensitivity, it is critical to version the metadata, which includes information about data sources, transformations, and audit trails. For example the large image dataset LAION is a collection of URLs and metadata of the images, rather than the actual images.

4.6.3 Data management of AI model weights

While so far we have described the stage of the MLOps workflow before the actual training, model weights (i.e. the AI model, the output of the training stage) also undergo similar consideration when it comes to good data management practice. This is even more important in machine learning systems trained on personal data, since – depending on the model architecture – some elements from the training data might be memorised in the model. Goo practices include:

Model Cards: Document the model’s purpose, data sources, and performance metrics, ensuring transparency and accountability. This will be described more extensively in following chapters.
Model Versioning: Version control systems should track changes in model weights and architectures, facilitating debugging and updates. Multiple versions of the same model might also exist not only as snapshots in time, but also in the dimension of numerical precision. This refers to the concept of quantization of model weights: instead of storing the weights with full numerical precision, a smaller number of bits can be used (quantization) balancing the trade-off between model size and model performance.

We will later also cover the fact that weights can also be published (open weights) following practices that are common with open source software. For further readings on the topic of data quality standards in AI systems, ISO/IEC 5259-2:2023 is a good starting point.

4.6.5 Access Control and Good Cybersecurity Practices

We have already briefly touched on the issue of access control when working in systems that could potentially be shared with other users. Access control is fundamental when working with sensitive personal data both from the data storage perspective as well as computing. The reader should familiarise with common good practices for access control and make sure they are implemented when personal data is used to train or query an AI model.

Multi-Factor Authentication (MFA): All storage systems should be protected by MFA to minimize risks of unauthorised access.
Internet Isolation: For highly sensitive data, disconnect computing and storage systems from the internet during processing.
Cloud and HPC Security Risks: When using cloud providers or high-performance computing (HPC) centers, ensure that contractual agreements include robust data protection measures and that access to the cloud infrastructure is tightly controlled.

Encryption of the data both in transit and at rest is also a very good practice to mitigate the risks of data breaches. While encryption comes with the decryption overhead computing cost, there are also computing techniques that can natively work with encrypted data (see next chapter). Finally, it is also important to implement regular security audits and penetration tests to identify vulnerabilities of systems that are shared with other users or that are not isolated from the internet.

4.7 Summary

In this chapter we have covered the data engineering pipeline with focus on solutions that are specific to those cases dealing with personal data. We also covered good data management practices that should go along other cybersecurity practices with machine learning systems. These practices should be a core part of any MLOps workflow and are indispensable when working with personal data.

Exercise 4.3: Multiple choice questions

Question	Options
1. What is the primary goal of data sanitization in the pre-processing stage?	1) Increase the size of the dataset. 2) Improve statistical properties. 3) Reduce the risk of re-identification. 4) Improve data visualization.
2. Which of the following is NOT a data acquisition method as per ISO/IEC 22989?	1) First-party data collection 2) Web scraping 3) Third-party data collection 4) Data querying
3. What does the GDPR emphasize regarding the risks of data querying?	1) It explicitly prohibits data querying. 2) It requires mitigating risks through DPIAs and purpose limitation. 3) It allows data querying only for non-personal data. 4) It does not address data querying.
4. In the context of MLOps, what is the role of stratified splitting during data preparation?	1) Ensures balanced class representation across subsets. 2) Randomly splits data into training and test sets. 3) Splits time-series data based on time order. 4) Ensures all data is used for training.
5. What is the main purpose of perturbation in data sanitization?	1) Completely remove identifiers. 2) Transform records while preserving statistical properties. 3) Replace data with broader categories. 4) Identify outliers in the data.
6. Which of the following is an example of a quasi-identifier?	1) Social security number 2) Name 3) Date of birth 4) Fingerprint
7. What does ISO 27560 provide guidelines for?	1) Data versioning. 2) Consent management. 3) Feature engineering. 4) Model evaluation.
8. Which method in data pre-processing is used to remove or transform direct identifiers like names?	1) Perturbation 2) Suppression 3) Filtering 4) Anatomisation
9. What is a key risk when using web-scraped data for AI training?	1) Increased cost. 2) Lack of proper legal basis under GDPR. 3) Reduced statistical accuracy. 4) Inability to preprocess the data.
10. What is the role of validation data in the MLOps workflow?	1) Train the AI model. 2) Evaluate model generalization. 3) Optimize hyperparameters. 4) Provide real-world performance feedback.

Exercise 4.3. Solutions

Click to reveal solutions

Answer: 3) Reduce the risk of re-identification.

Explanation: Data sanitization aims to mitigate privacy risks by removing or transforming identifiers and quasi-identifiers.
Answer: 2) Web scraping

Explanation: Web scraping is not explicitly categorized as a data acquisition method in ISO/IEC 22989.
Answer: 2) It requires mitigating risks through DPIAs and purpose limitation.

Explanation: GDPR highlights the need for DPIAs and compatibility with the original purpose when querying data containing personal information.
Answer: 1) Ensures balanced class representation across subsets.

Explanation: Stratified splitting ensures that all subsets maintain proportional representation of different classes, especially in imbalanced datasets.
Answer: 2) Transform records while preserving statistical properties.

Explanation: Perturbation involves introducing randomness to maintain data utility while protecting privacy.
Answer: 3) Date of birth

Explanation: Quasi-identifiers, like dates of birth, can reveal identities when combined with other datasets.
Answer: 2) Consent management.

Explanation: ISO 27560 provides a framework for managing consent in data workflows.
Answer: 2) Suppression

Explanation: Suppression removes direct identifiers, such as names or IDs, from the dataset.
Answer: 2) Lack of proper legal basis under GDPR.

Explanation: Web scraping of personal data often lacks valid consent or legal basis under GDPR.
Answer: 3) Optimize hyperparameters.

Explanation: Validation data is used during model training to fine-tune hyperparameters and prevent overfitting.