1 Elements of Artificial Intelligence, Data Protection, Cybersecurity

Learning outcomes

After completing this chapter you will:

Acquire the basic vocabulary of AI, data privacy, and cybersecurity.
Understand the fundamental principles of AI, its applications, and how it compares to more traditional deterministic approaches.
Understand the compromise between AI performance, data protection, cybersecurity.

In this chapter we will cover the basic principles and definitions for the three domains that we are studying in this course: artificial intelligence, data protection in the EU, cybersecurity.

1.1 What is Artificial Intelligence?

Multiple definitions of Artificial Intelligence (AI) exist since the “Meetings of the Minds” workshop in 1956 – considered to be the first recorded definition of AI. In general AI is an umbrella term for a series of computational methods and systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages. AI is defined as the science and technology of computing systems that can autonomously solve complex inferential problems, often mimicking tasks that humans can perform but computers traditionally could not, such as image recognition or decision-making. AI can be approached through various methods, including rule-based systems where domain experts manually craft sets of rules for the computer to follow. In contrast, Machine Learning (ML), originally a sub-field of AI, focuses on developing algorithms that enable computers to automatically learn patterns and predictive models from data, making AI systems more adaptive and scalable. For example, instead of programming explicit rules to recognise handwritten digits, ML can train a model using thousands of labelled images to learn the patterns on its own. While ML is a crucial tool within AI, AI encompasses much more than ML, including areas like natural language processing (NLP), robotics, and expert systems. Meanwhile, ML can also extend beyond AI applications, being fundamental in fields like predictive analytics, where it may be used without invoking broader AI concepts. Figure 1.1 visually summarises the various sub-branches of AI, with particular focus on ML.

Figure 1.1: **Definitions of AI** - The top panel is adapted from Aliferis and Simon (2024). The bottom left panel is adapted from “A New Dawn for Public Employment Services. OECD” (2024). The bottom right panel is adapted from @International Organization for Standardization (2022a). It is important to notice that the OECD definition considers ML as a subset fully contained inside AI. In practice it is not as simple as that, and a more nuanced description is provided by Raschka (2024) where AI is broader than ML, and ML can be used in systems that are not AI systems (this is visually summarised in the small Venn diagram of AL and ML not fully overlapping)

1.1.1 AI from data: data science versus machine learning

While not all AI is built from data, the most useful and popular AI uses are all based on learning patterns from data: whether it is our smartphone that has learned visual features from our face to unlock itself, or a Large Language Model (LLMs) in AI systems like ChatGPT that has learned textual patterns from a huge amount of books, transcripts, and scraped content, all distilled in its deep neural network.

How is AI working in practice? Compared to other data science deterministic methods, AI uses the methodological approaches of Machine Learning (ML) to derive patterns from the training data. The training data are a dataset that is curated accordingly so that the machine learning algorithm can learn the pattern in the dataset and adapt its model weights to perform a certain task.

For example let’s imagine that we have a dataset of the systolic blood pressure for a set of patients. For the sake of simplicity let’s just assume that those with systolic blood pressure over 140 mm Hg are labelled with “high risk” (of cardiovascular disease) and those with blood pressure lower than 140 mm Hg are labelled with “normal” (things are more complicated than this with cardiovascular diseases, but this is just a simple example). With a data science approach, we could create a rule based system (deterministic) so that, in pseudocode:

IF BP >= 140 THEN
    label = "high risk"
ELSE
    label = "normal"
END IF

This deterministic system would just process each data subject and generate the label according to the rule. The system would be robust and the rule would be explainable and readable by a person. The ML approach to the same problem would be to choose a specific ML algorithm (in this case a classifier) so that it can be trained from existing data so that when a certain numerical value of systolic blood pressure and labels are given, the ML algorithm tries to adapt its model weights. Then, when a new dataset is used – the test dataset – the AI system is able to predict that a subject with blood pressure higher or equal to 140 mm Hg is receiving a label of “high risk”.

Exercise 1.1: Discuss with your peer

Suppose that the training dataset does not have any data for blood pressure measurements equal to 138, 139, 140 mm Hg: Would it still find the same rule as the deterministic system?

Exercise 1.1: Solution

The ML model could set 138 as the threshold for separating normal and high-blood pressure subjects. The bias present in the data, is reflected in the potential diagnostic errors produced by this AI system.

In this toy example we immediately understood a core limitation of ML approaches: if the data in the training set does not cover all the possible cases that we want to sample, the robustness and accuracy of the ML output might not be the desired one.

Let’s now imagine to extend our example so that, rather than a single measurement for each individual, a collection of data are used to train the ML model: for example we might have information about the weight, height, history of heart conditions, blood exam values, diet habits, sport habits, smoking, and so on. The ML learning algorithm could use all these data and learn how they can be associated with the final label “high risk” or “normal” as in the example before. Due to the richness of the data, it would become more difficult to explain how the machine learning model operates, but possibly it could perform better in defining a more precise diagnosis for the patient, rather than the human doctor who would need to evaluate all these data according to their knowledge and experience. To understand how AI systems learn from data, let’s explore the various machine learning methods that form the backbone of modern AI. It is important to keep in mind that AI includes approaches beyond ML predictions methods, with techniques such as rule-based systems, symbolic reasoning, and search algorithms that also try to mimic human intelligence. Some examples of these are provided in chapter 11.

1.1.2 Machine Learning Methods

Machine learning methods provide the techniques to teach computers how to learn from data. There are three broad categories of ML methods (International Organization for Standardization 2022b):

Supervised Learning: the AI model is trained using labelled data: each input data item is paired with the correct output, and the model learns to predict the output from the input. For example in a dataset of images of faces labelled with “happy” or “sad”, the model learns how to classify new pictures based on this labeled data. Commonly used algorithms: linear or logistic regression, decision trees, support vector machines, neural networks. Examples of applications: spam detection, house price prediction (predicting house prices based on features like size and location), medical diagnosis (classifying whether a patient has a certain disease based on symptoms)
Unsupervised Learning: the AI model is trained on data without labeled outcomes; the model tries to find patterns and structures in the data on its own. For example in a dataset of customer purchases from an online store, the model can group similar customers together based on their purchase patterns. Commonly used algorithms: clustering (e.g. K-means, hierarchical), principal component analysis, autoencoders (neural networks). Examples of applications: customer segmentation, anomaly detection, market basket analysis (finding associations between products, like which items are often bought together).
Reinforcement Learning: Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent takes actions and receives feedback in the form of rewards or penalties, with the goal of maximizing its cumulative reward over time. Unlike supervised learning, where the model learns from labeled data, reinforcement learning focuses on learning from the consequences of actions. For example, a robot might learn to navigate a maze by receiving positive rewards for moving toward the exit and negative rewards for hitting walls. Commonly used algorithms: Q-learning, Deep Q-Networks (DQN), Policy Gradient Methods, Actor-Critic Methods. Examples of applications: Game AI (e.g., AlphaGo, which learned to play Go), robotics (e.g., robots learning to walk or pick up objects), autonomous vehicles (learning to navigate roads safely), recommendation systems (e.g., learning to recommend content based on user interaction).

There are other categories of ML methods that can be blurred between the three categories above:

Semi-supervised Learning: Semi-supervised learning is a blend of supervised and unsupervised learning. In this approach, the AI model is trained using a dataset that contains a small amount of labeled data and a large amount of unlabeled data. The labeled data helps the model learn initial patterns, while the unlabeled data allows it to generalize better to new examples. For example, a model might be trained on a large dataset of images where only a few images are labeled with categories like “dog” or “cat,” and the rest are unlabeled. The model learns to identify patterns in both labeled and unlabeled images, improving its accuracy without needing a fully labeled dataset. Commonly used algorithms: Graph-based algorithms, self-training, deep neural networks (e.g., semi-supervised GANs). Examples of applications: Medical image analysis (where labeling every image is costly), web page classification, speech recognition.
Self-supervised Learning: Self-supervised learning is a type of learning where the model generates its own labels from the data, typically by predicting parts of the data from other parts. It is often used as a way to pre-train models on large, unlabeled datasets, which can then be fine-tuned on smaller labeled datasets. For instance, in natural language processing, models like GPT (Generative Pre-trained Transformer) are pre-trained using self-supervised learning by predicting the next word in a sentence. In this approach, no manually labeled data is needed, as the model creates its own task based on the data itself (e.g., predicting missing words in a sentence). Commonly used algorithms: Transformers (e.g., GPT, BERT), contrastive learning algorithms. Examples of applications: Pre-training large language models like GPT for text generation, BERT for sentence understanding, image representation learning where parts of an image are masked and the model is trained to reconstruct them.

Exercise 1.2: Discuss with your peer

Consider an AI system with a ML model that is able to classify emotions from pictures of faces. Do you think this AI system is actually able to truly infer emotions? Are there any privacy risks?

Exercise 1.2: Solution

This is a very difficult example and touches on an important privacy risk of AI systems: “Phrenology / Physiognomy” (inferring personality, social, and emotional attributes about an individual from their physical attributes, Lee et al. (2024)). AI systems performing biometric categorization (assigning natural persons to specific categories on the basis of their biometric data, AI Act Recital 16) have been criticised for being against the fundamental right to dignity and for being pseudoscience Andrews, Smart, and Birhane (2024).

Notes

For the learner: if you are interested in these topics, please start with the references linked in this section.
For the instructor: this is a good task also for a homework based on a reading assignment with one or more of the references mentioned here.

The broad categories of ML paradigms are implemented with algorithms. What makes AI more challenging compared to other data science deterministic approaches, is the fact that due to the complexity of certain algorithms, developers and users are not able to explain how the algorithm comes to produce a certain output. Explainable AI is an important topic of research, and in the context of AI regulation and risk, algorithms have been categorised depending on the level of explainability they can provide. The less explainable algorithms, the higher the chances that the AI system could be considered high-risk or prohibited according to the AI Act (more on this in Chapter 2).

1.1.3 AI systems and AI models

In the sections above we have often mentioned AI Systems and AI models. It is important to understand that they are not the same thing: you might be developing an AI model but not deploying it into an AI system, and viceversa you might be building an AI system although you are just reusing existing AI models (or systems) that you did not develop yourself.

Let’s have a look at the definition in the AI Act article 3:

‘AI system’ means a machine-based system that is designed to operate with varying levels of autonomy, and that may exhibit adaptiveness after deployment, and that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments.

An AI system has three main components: input (perception), processing (reasoning/decision making), and output (actuation). An AI system can be used by a user or by another application to perform a certain task based on the AI model weights and software implementation. To grasp the difference intuitively, the AI model is like the engine of a car: it can be very powerful, but without the rest of the car (AI system) it is just a large static dataset.

The AI Act does not define AI models, however we can reuse the definition from ISO standards:

‘AI model’: an AI (or ML) model is a mathematical construct that generates an inference or prediction, based on input data or information (see International Organization for Standardization 2022b, sec. 3.2.11)

The AI Act however has a definition for General Purpose AI models (GPAI) and we will get back to this in chapter 2.

The AI Act definitions are largely based on OECD (2024) and they are visually summarised in Figure 2.1.

Figure 1.2: **AI systems and AI models** - The figure is adapted from OECD (2024).

1.1.4 Real use cases of AI systems

There are various types of AI systems, some can involve more privacy risks than others especially if they are used to process certain categories of personal data. In chapter 2 we will go deeper on the intersection between AI systems, ethics, privacy, and risk assessment. Here some possible use cases of AI systems:

AI for Code Assistance: An AI-powered tool helps software developers by suggesting code snippets, debugging errors, and optimizing performance. It uses pre-learned programming patterns to assist coders but doesn’t require any personal information from users.
Personalized Movie Recommendations: A streaming service uses AI to analyze your viewing history and recommend new movies or TV shows based on your preferences. The system tracks your behavior over time, including which movies you watch, skip, or search for.
AI-Powered Fitness Tracker: A wearable fitness tracker uses AI to monitor your physical activity, track your heart rate, and provide personalized workout recommendations. It collects data about your daily routines, health metrics, and even your location when you’re running or cycling.
AI for Smart Homes: An AI system integrated into a smart home controls lighting, temperature, and security features. It learns from your daily habits to optimize energy usage and improve comfort. The system has access to your home environment, routines, and potentially records video and audio within the house for security.
AI-Driven Virtual Assistant: A virtual assistant, like Siri or Alexa, uses AI to handle voice commands, schedule appointments, and answer questions. It continuously listens to your conversations, analyzes your voice patterns, and has access to your calendar, emails, and personal notes to provide more tailored services.
AI in Autonomous Vehicles: Autonomous vehicles rely on AI to navigate roads, avoid obstacles, and make real-time decisions. They collect data from external sensors and cameras, but they also analyze data about passengers, such as their location, destination, and sometimes even their in-car conversations.
AI in Predictive Healthcare: AI systems used in hospitals can predict patient outcomes based on vast amounts of medical data, including diagnosis, treatment history, and genetics. These systems can help doctors make decisions but require extensive access to sensitive medical records, test results, and personal health information.
AI in Facial Recognition Surveillance: A facial recognition system powered by AI is used for security purposes in public spaces. It scans faces in real-time to identify individuals, matching them against a database of known people. These systems have access to biometric data and can track individuals’ movements, raising significant privacy and ethical concerns.

Exercise 1.3: Discuss with your peer

Do you see a pattern in the order on how the systems are presented?

Exercise 1.3: Solution

The AI systems are shown with increasing order of risk towards individuals. The last AI system is one of the prohibited AI systems according to the AI Act.

1.1.5 Learn more about the basics of AI and ML

This section only briefly covered the elements of AI. The learner who wants to explore further on the topic should consider taking specialised online courses like the MOOC Elements of AI. Sebastian Raschka has great learning materials on the topic of ML (with Python), please see the Courses page on Sebastian Raschka’s website. Next, we will introduce the basic principles of data protection in the context of the EU General Data Protection Regulation.

1.2 What is personal data?

In the previous section you familiarised with the concepts of Artificial Intelligence and Machine Learning, how they can be compared with more deterministic approaches. The second pillar of our course is Personal Data (PD). Let’s have a look at the definition of personal data according to the General Data Protection Regulation article 4:

“‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;”

Personal data is then any data about a living individual, however, depending on the type of data, we can introduce some more useful definitions.

1.2.1 Principles of the GDPR

The General Data Protection Regulation (GDPR) is structured around seven fundamental principles, as outlined in Articles 5–11, that govern how personal data should be processed and protected. These principles are essential for ensuring that personal data is handled in a lawful, fair, and transparent manner while also safeguarding individuals’ rights and freedoms. For cybersecurity technologists and AI system providers, particularly those dealing with high-risk AI systems as defined under the AI Act, adhering to these principles is crucial throughout the lifecycle of personal data, from collection to storage, processing, and eventual deletion or anonymization. Below is a table summarizing the key principles, along with examples of how they apply to AI system providers and deployers.

Principles of the GDPR
GDPR Principle	Definition	Example for an AI system provider/deployer
Lawfulness, Fairness, and Transparency	Personal data should be processed legally, fairly, and in a transparent manner.	Ensure users are informed about AI system data collection and processing through transparent privacy notices, while the purpose of the system is legal.
Purpose Limitation	Data should be collected for specific, explicit, and legitimate purposes and not processed in a manner incompatible with those purposes.	Limit AI model access to data strictly relevant to its function, preventing unintended uses.
Data Minimization	Data processing should be adequate, relevant, and limited to what is necessary for the intended purposes.	Design AI models to operate on minimal personal data, reducing unnecessary collection.
Accuracy	Personal data should be accurate and kept up to date. Inaccurate data should be erased or rectified without delay.	Incorporate mechanisms in the AI system to flag and correct outdated or incorrect data in real-time.
Storage Limitation	Data should be kept in a form that permits identification of data subjects for no longer than necessary.	Set automated retention periods to delete or anonymize data used by the AI system after its purpose is fulfilled.
Integrity and Confidentiality (Security)	Personal data should be processed in a manner that ensures appropriate security, including protection against unauthorized access and data breaches.	Implement strong encryption and access controls within AI systems to protect against data breaches.
Accountability	The data controller is responsible for, and must be able to demonstrate, compliance with the GDPR principles.	Maintain thorough documentation of data processing steps within the AI system, ready for audits or compliance checks.

1.2.2 The Rights of the Data Subject

Under the GDPR, data subjects have specific rights regarding their personal data, which must be respected by organizations processing this data, including AI system providers and deployers. These rights ensure individuals have control over their personal data and can exercise these rights at any point during the data processing lifecycle. For AI systems, particularly those classified as high-risk under the AI Act, it’s crucial to integrate mechanisms that respect these rights, as non-compliance can lead to significant legal and financial penalties. The relevant GDPR articles (Art. 15–22) outline these rights, which include access to personal data, correction of inaccuracies, and the ability to object to certain forms of data processing. The following table summarizes the key rights and provides short examples of how AI system providers can implement measures to comply.

Rights of the data subjects
Data Subject Right	Description	Example for an AI system provider/deployer
Right to Access (Art. 15)	Individuals have the right to access their personal data and obtain information about how it is being processed.	Provide users with a secure portal to view the personal data processed by the AI model and detailed information about how it is used in decision-making.
Right to Rectification (Art. 16)	Individuals can request the correction of inaccurate or incomplete personal data.	Allow users to easily request corrections to personal data used by the AI model, ensuring timely updates to the data and retraining of models if necessary.
Right to Erasure (“Right to be Forgotten”) (Art. 17)	Individuals can request the deletion of their personal data in certain circumstances.	Implement a system-wide data deletion process that permanently removes personal data from AI models and databases upon valid user requests.
Right to Restriction of Processing (Art. 18)	Individuals can request the restriction of their data processing under certain conditions.	Integrate a functionality that pauses the processing of personal data within the AI system while retaining the data securely until the restriction is lifted.
Right to Data Portability (Art. 20)	Individuals have the right to receive their personal data in a structured, commonly used format and transmit it to another controller.	Provide an export feature allowing users to download their data used by the AI system in common formats like JSON or CSV.
Right to Object (Art. 21)	Individuals can object to the processing of their personal data, including for direct marketing purposes.	Offer a simple opt-out mechanism in the AI system that halts data processing when a user objects, especially for activities like profiling or targeted advertising.
Right not to be Subject to Automated Decision-Making (Art. 22)	Individuals have the right not to be subject to decisions based solely on automated processing, including profiling, that produce legal or similarly significant effects.	Implement a human-in-the-loop process, ensuring that users can request a manual review of any high-impact decisions made by the AI system.

Exercise 1.4: Discussion in classroom

Do LLMs contain personal data?

Please discuss with the instructor or with your peers the following questions:

Do LLMs contain personal data and how can it be proved?
How would the rights of the data subjects be implemented with LLMs?

Note for the instructor: This can also be a homework assignment with further readings such as the “EDPB opinion on the use of personal data for the development and deployment of AI models” (European Data Protection Board (EDPB) 2024) or “Discussion Paper: Large Language Models and Personal Data” (The Hamburg Commissioner for Data Protection and Freedom of Information 2024).

1.2.3 Legal Bases of the GDPR

Under the GDPR, the processing of personal data must have a valid legal basis. These legal bases are outlined in Article 6 of the GDPR and provide the lawful grounds under which data can be collected, processed, and stored. For AI system providers and deployers, especially those handling high-risk systems, it’s crucial to understand which legal basis applies to their operations. Depending on the purpose of the AI system, different legal grounds may be used, such as user consent or legitimate interest. However, not all legal bases may be suitable, particularly for high-risk systems involving sensitive data. Below is a table that summarizes each legal basis and provides examples of how they can—or cannot—be applied by AI providers or deployers.

Legal bases of the GDPR
Legal Basis	Description	Example for an AI system provider/deployer
Consent (Art. 6(1)(a))	The data subject has given explicit consent for their personal data to be processed for specific purposes.	Obtain user consent before processing personal data for personalized recommendations in an AI-powered application. Consent must be clear, freely given, and revocable at any time.
Contractual Necessity (Art. 6(1)(b))	Data processing is necessary for the performance of a contract with the data subject.	Use personal data to fulfill the terms of a contract, such as processing a user’s data in an AI-based financial service that they have signed up for.
Legal Obligation (Art. 6(1)(c))	Data processing is necessary for compliance with a legal obligation.	When using AI for fraud detection in compliance with financial regulations. The AI system processes personal data to fulfill obligations under specific legal frameworks.
Vital Interests (Art. 6(1)(d))	Data processing is necessary to protect someone’s life or prevent serious harm.	Rarely Applies: This may apply in limited scenarios, such as an AI system used in emergency healthcare situations to prevent immediate harm. Typically, this legal basis is not applicable for most AI systems.
Public Interest (Art. 6(1)(e))	Data processing is necessary for tasks carried out in the public interest or exercise of official authority.	Rarely Applies: Generally applicable when AI systems are deployed by governmental bodies for public interest tasks, such as AI systems used in law enforcement. This would not apply to most private AI providers.
Legitimate Interests (Art. 6(1)(f))	Data processing is necessary for the legitimate interests of the controller or a third party, provided it does not override the data subject’s rights.	This is currently being debated under which conditions such legal basis can be lawfully used by AI providers.

1.2.4 Special categories of personal data

While all data about a single individual is personal data, certain types of personal data require extra care. The GDPR identifies “special categories” of personal data under Article 9, which are subject to stricter protection due to their sensitive nature. These include data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, genetic data, biometric data for identification purposes, health data, and data concerning a person’s sex life or sexual orientation. In the context of AI systems, special care must be taken when handling such data, as its misuse can lead to significant privacy violations and harm. For instance, AI systems used in healthcare may process genetic or health data to predict disease risks or recommend treatments, while AI systems in employment contexts may inadvertently process data related to political opinions or trade union membership when evaluating job candidates. Under the GDPR, processing these special categories of data is generally prohibited unless specific legal grounds, such as explicit consent or substantial public interest, are met.

1.2.5 Learn more about data protection regulation

There are online courses to learn more about data protection regulation, for example this free 2 credits MOOC by University of Helsinki: Introduction to Data Protection Law.

1.3 What is Cybersecurity?

Cybersecurity is the practice of protecting data, systems, networks, and software from unauthorized access, attacks, damage, or disruption. It involves implementing a broad range of strategies and technologies to secure the digital environment, from individual software components to large interconnected infrastructures. Effective cybersecurity ensures that sensitive data remains protected, systems function reliably, and unauthorized parties are kept at bay. The ultimate goal is to safeguard digital assets, maintain operational continuity, and promote trust in digital interactions, whether for personal use, corporate functions, or government operations.

When dealing with AI systems, cybersecurity becomes even more critical due to the unique vulnerabilities these systems can introduce. AI systems process large volumes of data, often sensitive personal information, and can make high-stakes decisions in areas like healthcare, finance, and security. From securing training data to ensuring model integrity and protecting real-time AI applications from adversarial attacks, cybersecurity measures must be adapted to address the specific risks of AI systems.

1.3.1 The CIA Triad

The CIA triad is a foundational model that guides cybersecurity practices, representing the three core principles essential for securing data and systems (Breaux 2020, ch. 9). These principles are especially important in AI systems, where the confidentiality, integrity, and availability of data and models are paramount.

Confidentiality: This ensures that sensitive information is only accessible to those authorized to see it. In AI systems processing personal data, this may involve encrypting or anonymising training data, limiting access to model outputs, and ensuring that personal data used for training is protected from unauthorized parties. For example, AI models trained on sensitive healthcare data must be protected to avoid data leaks that could reveal private patient information. Implementing access controls and encryption for AI model data ensures that confidential information is safeguarded.
Integrity: This ensures the accuracy and consistency of data and AI models. Integrity in AI systems means protecting against unauthorized changes that could affect model behavior or outputs. For example, data poisoning attacks, where malicious inputs are introduced during model training, can corrupt the integrity of the AI model. Implementing checks such as hashing, model version control, and adversarial training helps ensure that the AI system behaves as intended and its outputs can be trusted.
Availability: This ensures that information, systems, and models are accessible when needed. AI systems need to be resilient to disruptions, such as denial-of-service (DoS) attacks or hardware failures that could render an AI model or system unavailable. Redundancy, load balancing, and robust monitoring mechanisms can help maintain the availability of AI systems, ensuring they remain operational in critical environments like autonomous vehicles or financial systems.

1.3.2 Authentication, access control, encryption

Three other fundamental concepts in the security of digital systems are authentication, access control, and encryption. We cover their basic definitions according to Breaux (2020).

1.3.2.1 Authentication

Authentication is closely tied to identity, a concept that we have explored already when introducing basic concepts of data protection. In digital systems that require security, it is fundamental that the individual performing an action matches the expected identity. In the scope of AI systems processing personal data, authentication is important not only to identify the user of the AI system, but also to ensure security of data by identifying software developers, data managers, data controllers, deployers, and any other party that is required to develop and operate the AI system. Common techniques for authentication are passwords, devices (like RFID keys, or mobile phones with an authenticator code), locations (not only the physical location but also the location of a computer in a network e.g. a computer inside a protected VPN might be authenticated), biometrics. The combination of multiple of these techniques is called multifactor authentication and it is becoming the de-facto standard when strong authentication is required. Chapter 4 from Breaux (2020) goes deeper on these topics and provides a list of limitations for each of the listed techniques.

1.3.2.2 Access control

For data to be useful it needs to be accessed (Breaux 2020): access control is closely related to authentication as the operation that specifies which (sensitive) data a certain individual or computer program is able to access and process. A few common models for access control are the access control list i.e. a list of subjects or groups of subjects that can access the data, rule based access control with rules that specify who can access the data, attribute-based access control and policy-based access control authorise access according to attributes that describe the user. Chapter 9 from Breaux (2020) expands further on this topic.

1.3.2.3 Encryption

Finally another important concept in cybersecurity and data privacy is encryption. Encryption is a fundamental technology used in multiple contexts from authentication and access control, to protected communication, and as we will see later it can also be used in secure computations. When it comes to data, encryption changes the way that data is stored so that the data is scrambled and cannot be reused without decrypting it. Without going too much into the details, it is important to understand the two main types of encryption: symmetric encryption where the same key is used to encrypt or decrypt the data, and asymmetric encryption where – for example – a public key encrypts the data and a private key is able to decrypt it. For a comprehensive reference on the topic, see Chapter 3 from Breaux (2020).

1.3.3 Learn more about cybersecurity

An important open source resource on these topics is the Open Worldwide Application Security Project (OWASP) with its initiatives like OWASP AI Security and Privacy Guide. Their GitHub repositories offer a great possibility to explore all the content that is available on topics related to cybersecurity and also AI. An excellent book that covers all the basic knowledge of privacy in technology is Breaux (2020).

Other concepts related to cybersecurity of AI systems will be covered in module 3.

1.4 Summary

In this chapter, we have provided a foundational understanding of AI and data science, data privacy, and cybersecurity. Often, experts in these three domains exist within the same organization but sometimes operate in different teams, not fully engaging with one another. This lack of communication can lead to conflicts and undermine the organization’s overall goals.

For example, the data scientist aims to maximize the performance of AI models, which could require access to vast amounts of detailed personal data. Training models on data that has been anonymized or heavily minimized may result in decreased performance. This can lead to frustration, as the data scientist feels constrained by limitations that prevent them from achieving optimal results.

On the other hand, the privacy expert is responsible for ensuring that the handling of personal data complies with laws and ethical standards. Their priority is to protect individual privacy by limiting the amount and type of data used, which may conflict with the data scientist’s objectives.

Meanwhile, the security professional focuses on safeguarding the organization’s systems and data. They may be wary of the data scientist’s desire to deploy code or data on external, less secure cloud platforms, fearing potential breaches or vulnerabilities. The security expert may restrict the movement of sensitive data outside certain secure environments, which can be seen as an obstacle by the data scientist.

These differing priorities highlight the necessity for collaboration among data scientists, privacy experts, and security professionals. AI regulations have begun to bring these experts together, emphasizing the importance of a holistic approach that balances performance, privacy, and security.

Privacy and security experts need to guide data scientists through the limitations imposed by law and ethics while also enabling their work by carefully evaluating and mitigating risks. Overly strict data policies can affect innovation and make the data scientist’s job nearly impossible. At the same time, data scientists must become more aware of the potential risks associated with their work, especially regarding public deployment of AI models.

1.5 Exercises

Here a series of exercises, the instructor can discuss these in class or use these as assignments for the students.

Exercise 1.5: Discuss with your peer

Can you think of an example where a system using a deterministic approach could be transformed into one that uses artificial intelligence (AI)?

Exercise 1.5: Solution

Example 1: University Entrance Exams

University entrance tests typically follow a deterministic approach, where a candidate’s score is based purely on the number of correct answers in a standardized test. If converted to an AI-based system, the scoring could take into account additional factors like a candidate’s educational background, personal data, performance patterns, and even non-cognitive factors such as personality traits or predicted future success.

However, there are inherent risks with such a transformation. An AI system could unintentionally introduce biases based on the candidate’s personal data, leading to unfair advantages or disadvantages. Furthermore, the lack of transparency in how the AI arrives at the final score could reduce trust in the system.

Example 2: Loan Approval Process

Traditionally, loan approvals are determined by fixed rules like income thresholds, credit scores, and debt-to-income ratios. This is a deterministic process. If an AI-based system is used instead, it might analyze a wider range of variables, such as spending behavior, social media activity, or personal relationships, to predict loan default risk.

While this could increase the accuracy of approvals, there are significant risks. The AI might amplify biases present in the data, leading to discrimination against certain groups of people. Additionally, the opacity of AI decision-making could make it hard for individuals to understand why they were denied a loan, challenging fairness and accountability.

Example 3: AI-Based Recruitment

Usually, recruitment processes rely on a deterministic approach where candidates are evaluated based on fixed criteria such as qualifications, years of experience, and performance in interviews. If transformed into an AI-based system, the recruitment process could analyze a broader set of factors such as social media presence, communication style, and personality traits from application materials or interviews using natural language processing (NLP).

This transformation introduces significant risks, especially in the context of the AI Act, which classifies employment-related decisions as high-risk AI systems. The use of AI in recruitment could unintentionally perpetuate biases present in the training data, leading to unfair discrimination against certain candidates based on factors such as gender, ethnicity, or socioeconomic background. Additionally, if candidates are rejected solely based on AI-driven decisions without human intervention, this may violate their rights under GDPR Article 22, which protects individuals from being subject to automated decision-making without adequate safeguards.

Exercise 1.6: Multiple choice questions

Question	Options
1. What is the primary distinction between Artificial Intelligence (AI) and Machine Learning (ML)?	1) AI is a subset of ML focusing on data patterns. 2) ML is a subset of AI focusing on data patterns. 3) AI and ML are completely separate fields with no overlap. 4) ML focuses on rule-based systems, while AI focuses on data-driven systems.
2. Which of the following is NOT one of the three broad categories of Machine Learning methods?	1) Supervised Learning 2) Unsupervised Learning 3) Reinforcement Learning 4) Deterministic Learning
3. In the context of GDPR, which of the following is NOT one of the seven fundamental principles?	1) Lawfulness, Fairness, and Transparency 2) Purpose Limitation 3) Data Monetization 4) Integrity and Confidentiality
*4. What is the main focus of the Integrity* principle in the CIA triad of cybersecurity?**	1) Ensuring data is accessible when needed. 2) Protecting data from unauthorized access. 3) Maintaining the accuracy and consistency of data. 4) Backing up data regularly.
5. According to the AI Act, an AI system is defined as:	1) Any machine-based system designed to operate without human intervention. 2) A machine-based system that operates with varying levels of autonomy to generate outputs influencing environments. 3) A mathematical construct generating predictions based on input data. 4) A software application that replaces human intelligence entirely.
6. Which GDPR principle requires that personal data be kept no longer than necessary?	1) Data Minimization 2) Storage Limitation 3) Purpose Limitation 4) Accuracy
7. What type of Machine Learning involves an agent learning by interacting with its environment through rewards and penalties?	1) Supervised Learning 2) Unsupervised Learning 3) Reinforcement Learning 4) Semi-supervised Learning
8. Under GDPR, which of the following is NOT considered a special category of personal data?	1) Genetic data 2) Biometric data 3) Financial data 4) Data concerning a person’s sex life
9. Which of the following rights allows data subjects to receive their personal data in a structured, commonly used format?	1) Right to Erasure 2) Right to Rectification 3) Right to Data Portability 4) Right to Object
10. In the context of AI and data privacy, what is the primary challenge when using Machine Learning algorithms?	1) They are too simple to handle complex tasks. 2) They require large amounts of data, which may conflict with data minimization principles. 3) They always produce explainable outputs. 4) They eliminate the need for data scientists.

Exercise 1.6. Solutions

Click to reveal solutions

Answer: 2) ML is a subset of AI focusing on data patterns.

Explanation: Machine Learning is a subset of AI that focuses on algorithms allowing computers to learn from data.
Answer: 4) Deterministic Learning

Explanation: Deterministic Learning is not a standard category; the three main categories are Supervised, Unsupervised, and Reinforcement Learning.
Answer: 3) Data Monetization

Explanation: Data Monetization is not one of the GDPR principles; the seven principles include Lawfulness, Fairness, and Transparency; Purpose Limitation; Data Minimization; Accuracy; Storage Limitation; Integrity and Confidentiality; and Accountability.
Answer: 3) Maintaining the accuracy and consistency of data.

Explanation: In the CIA triad, Integrity refers to maintaining data accuracy and consistency over its lifecycle.
Answer: 2) A machine-based system that operates with varying levels of autonomy to generate outputs influencing environments.

Explanation: This is the definition of an AI system according to the AI Act.
Answer: 2) Storage Limitation

Explanation: Storage Limitation requires that personal data be kept no longer than necessary for the purposes for which it is processed.
Answer: 3) Reinforcement Learning

Explanation: Reinforcement Learning involves learning through interactions with the environment using rewards and penalties.
Answer: 3) Financial data

Explanation: Financial data is not considered a special category under GDPR; special categories include genetic data, biometric data, health data, etc. However, do remember that public available personal data is still personal data and you might not have the right to process it lawfully.
Answer: 3) Right to Data Portability

Explanation: The Right to Data Portability allows individuals to receive their personal data in a structured, commonly used format.
Answer: 2) They require large amounts of data, which may conflict with data minimization principles.

Explanation: ML algorithms often need large datasets, which can conflict with GDPR’s data minimization principle requiring that only necessary data be processed.

“A New Dawn for Public Employment Services. OECD.” 2024. June 12, 2024. https://www.oecd.org/en/publications/2024/06/a-new-dawn-for-public-employment-services_25e1e70e.html.

Aliferis, Constantin, and Gyorgy Simon. 2024. “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML.” In Artificial Intelligence and Machine Learning in Health Care and Medical Sciences: Best Practices and Pitfalls, edited by Gyorgy J. Simon and Constantin Aliferis, 1–31. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-39355-6_1.

Andrews, Mel, Andrew Smart, and Abeba Birhane. 2024. “The Reanimation of Pseudoscience in Machine Learning and Its Ethical Repercussions.” Patterns 5 (9). https://doi.org/10.1016/j.patter.2024.101027.

Breaux, Travis. 2020. An Introduction to Privacy for Technology Professionals. International Association of Privacy Professionals.

European Data Protection Board (EDPB). 2024. “Opinion 28/2024 on Certain Data Protection Aspects Related to the Processing of Personal Data in the Context of AI Models.” https://www.edpb.europa.eu/system/files/2024-12/edpb_opinion_202428_ai-models_en.pdf.

International Organization for Standardization. 2022a. Framework for Artificial Intelligence (AI) Systems Using Machine Learning (ML) (ISO Standard No. 23053:2022). International Organization for Standardization. https://www.iso.org/standard/74438.html.

———. 2022b. Information Technology — Artificial Intelligence — Artificial Intelligence Concepts and Terminology (ISO/IEC Standard No. 22989:2022). International Organization for Standardization. https://www.iso.org/standard/74296.html.

Lee, Hao-Ping (Hank), Yu-Ju Yang, Thomas Serban Von Davier, Jodi Forlizzi, and Sauvik Das. 2024. “Deepfakes, Phrenology, Surveillance, and More! A Taxonomy of AI Privacy Risks.” In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. CHI ’24. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3613904.3642116.

OECD. 2024. “Explanatory Memorandum on the Updated OECD Definition of an AI System.” 8. Paris: OECD Publishing. https://doi.org/10.1787/623da898-en.

Raschka, Sebastian. 2024. Machine Learning q and AI: 30 Essential Questions and Answers on Machine Learning and AI. No Starch Press.

Stark, Luke, and Jevan Hutson. 2021. “Physiognomic Artificial Intelligence.” Fordham Intell. Prop. Media & Ent. LJ 32: 922.

The Hamburg Commissioner for Data Protection and Freedom of Information. 2024. “Discussion Paper: Large Language Models and Personal Data.” https://datenschutzhamburg.de/fileadmin/user_upload/HmbBfDI/Datenschutz/Informationen/240715_Discussion_Paper_Hamburg_DPA_KI_Models.pdf.