Appendix
In this appendix we cover a few extra concepts that did not fit in the main module, but might be important for the learners to familiarise with. The learners are encouraged to explore further topics by themselves.
Taxonomies of privacy risks
There are no comprehensive taxonomies of privacy risks. The taxonomy presented in Chater 2 is a synthesis of three taxonomies, specifically the one proposed in ISO 29134, the one proposed in the AEPD “Risk Management and Impact Assessment in Processing Personal Data” (Agencia Española de Protección de Datos 2021), and Solove’s “Taxonomy of Privacy” (Solove 2005) extended with risks for each of the 16 dimensions of privacy.
The table below lists the risks for the three taxonomies.
| From ISO 29134:2020 | AEPD (Agencia Española de Protección de Datos 2021) | Daniel J. Solove, A Taxonomy of Privacy, (Solove 2005). |
|---|---|---|
| unauthorized access to PII (loss of confidentiality); | Operations related to the purposes of processing - Risk factors deriving from the purpose stated of the processing and other purposes related to the main purpose (e.g. contact tracing, deciding on data subjects’ control of personal data, profiling, monitoring) |
Surveillance: Continuous monitoring in public spaces can lead to a loss of anonymity and chilling effects on free movement. |
| unauthorized modification of the PII (loss of integrity); | Types of data used - Risk factors related to the scope of the processing that arise from data collected, processed or inferred in the processing. (e.g. financial transactions, special categories of personal data) |
Interrogation: Overly invasive questioning by authorities might coerce personal disclosures, violating individual autonomy. |
| loss, theft or unauthorized removal of the PII (loss of availability). | Extent and Scope of Processing - Risk factors related to the scope of the processing relating to the number of data subjects concerned, the diversity of data or aspects processed, the duration in time, the volume of data, the geographical extent, the exhaustiveness on the person, frequency of collection, etc (e.g. large number of subjects, large scale processing) |
Aggregation: Combining data across sources can reveal patterns, leading to profiling or unintended exposure of identity. |
| excessive collection of PII (loss of operational control); | Categories of Data Subjects - Risk factors related to the scope of the processing related to the category of data subjects, such as employees, minors, elderly people, persons in a situation of vulnerability, victims, disabled people, etc (e.g. children under 14 yo, vulnerable subjects) |
Identification: Linking anonymous data to individuals can result in privacy breaches and potential harm from misuse. |
| unauthorized or inappropriate linking of PII; | Technical Factors of Processing - Risk factors that arise from the nature of the processing when implemented with certain technical characteristics or technologies. (e.g.internet of things, video surveillance, automated processing) |
Insecurity: Poor security of stored data increases risks of breaches and unauthorized access. |
| insufficient information concerning the purpose for processing the PII (lack of transparency); | Data collection and generation - Risk factors that arise from the nature of the processing when data are specifically collected or generated. (e.g. combinations of datasets) |
Secondary Use: Using data for purposes other than originally intended without consent can erode trust. |
| failure to consider the rights of the PII principal (e.g. loss of the right of access); | Side Effects of Processing - Risk factors that arise from the processing context as consequences may occur that are not foreseen in the original intended purposes of the processing. (e.g. unauthorised re-identification, identity theft, reputational damage, ) |
Exclusion: Denying individuals the right to access or correct their information can lead to misinformation and unfair outcomes. |
| processing of PII without the knowledge or consent of the PII principal (unless such processing is provided for in the relevant legislation or regulation); | Category of controller/processor - Context-related risk factors specific to the sector of activity, business model or type of entity (e.g.hospitals, financial institutions) |
Breach of Confidentiality: Unauthorized disclosure of confidential information risks harm to reputation and personal relationships. |
| sharing or re-purposing PII with third parties without the consent of the PII principal; | Data disclosure - Risk factors that arise from the context in which the data disclosures are made to third parties within the framework of the processing (e.g. regular transfers to other countries without adequate protection) |
Disclosure: Public release of private information can lead to embarrassment, harassment, or discrimination. |
| unnecessarily prolonged retention of PII. | Data breaches - Risk factors that arise from the possible materialisation of personal data breaches. |
Exposure: Making sensitive details accessible to others may violate personal boundaries and cause distress. |
| Increased Accessibility: Easy access to personal data online can invite misuse or unauthorized surveillance by others. | ||
| Blackmail: Threatening to reveal private information can coerce individuals into unwanted actions. | ||
| Appropriation: Using personal likeness or data for commercial gain without consent undermines autonomy and rights. | ||
| Distortion: Misrepresentation of personal information can damage reputation and create misunderstandings. | ||
| Intrusion: Physical or digital invasion into private spaces disrupts privacy and can induce fear or discomfort. | ||
| Decisional Interference: Intervening in personal decision-making processes infringes on autonomy and personal agency. |
Artificial Intelligence Explainability
Another important domain from Slattery et al. (2024) which is strongly related to the GDPR principle of fairness and transparency is 7.4 “Lack of transparency or interpretability”, which leads us to the concept of AI Explainability. For more in-depth considerations, please refer to Leslie et al. (2024).
AI explainability refers to the degree to which a system or set of governance practices and tools support a person’s ability to:
- Explain and communicate the rationale behind the behavior of the AI system.
- Demonstrate that the processes behind its design, development, and deployment ensure sustainability, safety, fairness, and accountability across various contexts of use.
Explainability involves providing clear reasons for:
- The outcomes produced by the algorithmic model (whether used for automated decisions or as inputs for human decision-making).
- The processes used to design, develop, and deploy the AI system.
AI explainability is crucial for several reasons:
- Trust and accountability: Explainability helps build trust in AI systems by allowing users and other relevant parties to understand why certain decisions were made and to hold developers and organizations accountable.
- Ethical compliance: It ensures that AI systems comply with ethical standards by making sure the processes and outcomes can be communicated and justified transparently.
- Fairness and safety: Providing explanations allows relevant parties to verify that AI systems are fair, safe, and not biased against certain individuals or group
- Regulatory requirements: In many sectors, explainability is a regulatory requirement to ensure that AI systems are transparent, particularly when they involve sensitive decisions about people’s lives.
The overall goal is to ensure that AI systems can be understood and trusted by a wide range of audiences, including those affected by their decisions.
Machine Learning algorithms and their explainability
This table is taken from the Annex of Leslie et al. (2024).
| Machine Learning Method | Explainability | Notes |
|---|---|---|
| Linear Regression (LR) | High | Simple, transparent model. Easy to interpret feature importance. |
| Logistic Regression | High | Similar to LR but applied to classification tasks. Interpretability is clear. |
| Generalized Linear Model (GLM) | Moderate | Extension of LR, handles non-normal distributions but may lose some transparency. |
| Generalized Additive Model (GAM) | High | Non-linear relationships still explainable through graphical representation. |
| Decision Tree (DT) | High | Interpretable as long as tree depth is manageable. |
| Rule/Decision Lists and Sets | High | Easy to follow but large lists may reduce interpretability. |
| K-Nearest Neighbors (KNN) | Moderate | Intuitive but less explainable for larger datasets. |
| Naive Bayes | High | Assumes feature independence; effective but can oversimplify relationships. |
| Support Vector Machines (SVM) | Low | Complex, especially in high dimensions; difficult to interpret decision boundaries. |
| Random Forest | Low | Aggregate of many decision trees, making overall model difficult to explain. |
| Artificial Neural Network (ANN) | Very Low | Highly non-linear with numerous parameters, making it a black-box model. |
| Ensemble Methods | Low | Combines several models, often leading to opacity. May require external interpretability tools. |
| Case-Based Reasoning (CBR) | High | Highly interpretable due to example-based reasoning. |
| Supersparse Linear Integer Model (SLIM) | High | Sparse model, interpretable by design with simple arithmetic calculations. |
Basics of privacy enhancing technologies
When introducing PETs we need some concepts related to the types of personal data that are more nuanced that the simple definition from the GDPR. In this section we provide a reference on teh terminology used in other articles or regulations. The reader is left to explore by themselves what each technique does.
While the definition of personal data covers all the possible types of data that can relate to an individual, in practice not all types of personal data are equal. The landscape of types of personal data and their definition is not homogeneous. In this section we try to cover some concept and make clear distinctions. For learners who wants to go deeper on this topic, consider reading Jarmul (2023).
Personally identifiable information (PII) these are also called “direct identifiers” or “strong identifiers”. Some example can be an individual’s full name, email address, phone number, social security number, and basically anything that can be used as a fingerprint to uniquely re-identify an individual. Biometric data (actual fingerprints, shape of the ear, wave-form of the heart signal, iris scan, voice, gait) is an excellent type of PII that can be used to uniquely identify an individual . It is clear already that different types of PII provide different level of “strength” of re-identification.
Quasi identifiers these are also called “indirect identifiers”, they are types of personal data that, in isolation, might not constitute PII to re-identify an individual with high confidence, however a collection of quasi identifiers from an individual can potentially constitute a fingerprint. A famous paper by Rocher et al. (Rocher, Hendrickx, and De Montjoye 2019) has shown that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. The picture below is an intuitive explanation from Zaman, Obimbo, and Dara (2017) on how a large amount of quasi identifier could allow possible linking of different types of data, to re-identify subject identity.
Pseudonymous data refers to personal data that has been processed in such a way that it can no longer be attributed to a specific individual without the use of additional information. This additional information is kept separately and is subject to technical and organizational measures to ensure that the data cannot be re-attributed to an identified or identifiable person. In some cases, the additional information is destroyed, but it is possible however to re-obtain it (e.g. by collecting a new fingerprint).
Anonymous data refers to information that has been irreversibly de-identified in such a way that the data subject is no longer identifiable. Once the data is anonymised, it is no longer considered personal data under the GDPR. What is important to understand is that the GDPR adopts the absolute view on anonymisation, i.e. data is anonymous if it impossible to re-identify any individual with the current technological means. This means that the operation of anonymisation is an irreversible operation that destroys the data so that no single individual can be re-identified.
An example of anonymization would be removing all identifying information, such as names, birth dates, or any other combination of quasi-identifiers, from a dataset and applying further techniques like generalization or data masking to prevent re-identification.
Pseudonymization, on the other hand, involves replacing or obscuring personal identifiers with pseudonyms (such as random strings of numbers or letters) to make it more difficult to identify individuals. However, unlike anonymization, pseudonymization does not completely eliminate the risk of re-identification. The original data can often be re-linked to individuals if the pseudonyms are reversible, typically by someone with access to the “key” that can decrypt or reverse the pseudonymization. Therefore, pseudonymized data is still considered personal data under the GDPR, as it remains possible to re-identify individuals with additional information.
In practice, pseudonymization is often used to reduce the privacy risks associated with personal data while still allowing for data utility and further processing. It is particularly useful in scenarios where data needs to be shared or analyzed while limiting the exposure of directly identifying information. On the other hand, anonymization is applied when the goal is to completely eliminate the possibility of re-identifying individuals, often when sharing data publicly or in open-access scenarios.
The choice between these two approaches depends on the context and the intended use of the data. While pseudonymization provides flexibility in data processing with some privacy protection, anonymization offers a stronger form of data protection by making re-identification impossible, though it may limit the ability to link data back to specific individuals for future analyses.