7  Testing and Validating AI Systems

TipLearning outcomes

After completing this chapter, you will:

  • Understand key security threats in the AI lifecycle and their impact on systems.
  • Learn testing methods like red teaming, black box, and white box testing.
  • Explore principles of AI alignment to ensure safe and ethical outputs.

We continue our exploration of the AI lifecycle and focus on the “Verification and Validation” stage. In the MLOps workflow we are basically moving from the “experimentation” to the “production” stage. The verification and validation stage ensures that the AI system from the design and development stage works as expected using the test data that was left out from development. This is also the stage where it’s time to put the system under stress and test if for example personal data can be extracted

In chapter 3 we briefly introduced the types of attacks that AI systems can suffer. Here we go deeper on the explanation of each type of attack and more broadly how an AI system can malfunction. The second part of this chapter focuses on testing strategies, to ensure that the AI model that we have developed is not prone to training (personal) data leaks or wrong predictions. Finally we conclude on the topic of alignment. It is important to notice that while our focus is on protecting personal data, the security considerations on this chapter apply to any AI system trained on any sort of sensitive or proprietary data that needs to be protected.

7.1 Security threats of AI systems

When considering the security of an AI system, it is important to introduce the three types of security threats (see below), when and where they happen in the AI lifecycle. A reference for this section is an important resource that the reader is encouraged to explore often is the OWASP AI Exchange Community (2025).

There are three types of of AI security threats:

  1. Development-Time Threats: threats occurring during the development phase of AI systems (data collection and preparation, model training)
  2. Threats Through Use: threats occurring when the AI model is in operation and interacting with users who provide inputs and receive outputs. Attackers exploit the AI system’s interfaces to deceive or extract sensitive information.
  3. Runtime Application Security Threats: threats that target the AI system while it is deployed and running in a production environment. Attackers aim to manipulate, steal, or disrupt the AI model by exploiting vulnerabilities in the operational setup.

The following table summarises the types of threads that we should be aware when developing and deploying AI systems

Type of Threat Threat Description Impact Example
Development-Time Threats Data Poisoning Inserting malicious data into the training dataset. Causes the AI model to behave undesirably or make incorrect predictions. Corrupting a facial recognition dataset to misidentify individuals.
Model Poisoning Altering model parameters or architecture during development. Embeds vulnerabilities or backdoors into the AI model. Modifying the model to allow future unauthorized access.
Supply Chain Attacks Compromising third-party tools, libraries, or models used in development. Introduces vulnerabilities or malicious code into the AI system. Injecting malicious code into a popular open-source library used for training.
Threats Through Use Evasion Attacks Crafting inputs designed to deceive the AI model. Causes misclassification or incorrect outputs without modifying the model. Altering a stop sign image to fool an autonomous vehicle’s recognition system.
Model Inversion Inferring sensitive training data from the model’s outputs. Breaches privacy by revealing personal or confidential data. Reconstructing data about individuals used in training.
Membership Inference Determining if specific data was part of the training set. Reveals sensitive information about individuals included in the dataset. Identifying whether a person’s data was used in training a criminal risk prediction model.
Prompt Injection Manipulating input prompts to cause the AI to generate harmful or unintended outputs. Circumvents filters, causing the model to produce malicious or inappropriate content. Convincing a language model to disclose sensitive information or generate offensive text.
Runtime Application Threats Runtime Model Poisoning Gaining unauthorized access to alter the model during operation. Causes the model to perform unintended actions or facilitate further attacks. Modifying a deployed fraud detection model to overlook fraudulent transactions.
Model Theft (Extraction) Reconstructing the model’s functionality through repeated queries. Intellectual property theft and loss of competitive advantage. Systematically querying an AI service to rebuild the model.
Denial of Service (DoS) Overwhelming the AI system with excessive requests or computational demands. Degrades service quality or renders the AI system unavailable. Flooding an AI-powered API with resource-intensive queries to crash the service.
Insecure Output Handling Outputs contain malicious code or sensitive information. Enables cross-site scripting (XSS) attacks or leaks confidential data to users. An AI chatbot generating responses with embedded malicious scripts.

7.2 A taxonomy of attacks on AI systems

This section provides a brief overview of key attacks targeting machine learning (ML) systems, categorized by their objectives and attack phases from Vallet (2022), expanding on the OWASP AI Exchange categorisation from the previous section, with a focus on types of attacks:

  1. Manipulative attacks deceive systems during production
  2. Infection attacks corrupt systems during training
  3. exfiltration attacks steal sensitive information

7.2.1 Manipulative Attacks

Manipulative attacks aim to deceive AI systems during their production phase by providing malicious or unexpected inputs, causing the system to behave incorrectly.

  • Evasion Attacks: Evasion attacks involve crafting inputs specifically designed to fool the model. For example, adversarial examples are slightly altered inputs (e.g., images, text, or sound) that appear normal to humans but trick the model into making incorrect predictions. Goodfellow et al. (2014) demonstrated how adding imperceptible noise to an image could make a classifier misidentify it entirely. These attacks exploit weaknesses in how models generalize from training data and are particularly dangerous because they do not require altering the model itself.

    • Example: Adversarial patches placed in an image can make a model misclassify objects, like labeling a “stop” sign as a “speed limit” sign, as shown by Eykholt et al. (2018).
  • Adversarial Reprogramming: Elsayed et al. (2018) introduced a way to hijack an ML model to perform unintended tasks. For instance, a model designed for image classification could be covertly modified to solve unrelated problems, like identifying patterns useful for cryptocurrency mining. This is achieved by retraining the model with data that embeds the new task.

  • Denial-of-Service (DoS) Attacks: These attacks target the availability of the AI system, often by overwhelming its computational resources. Engstrom et al. (2019) explored scenarios where malformed inputs or computationally intensive requests degrade a system’s performance, making it unable to respond to legitimate queries.

7.2.2 Infection Attacks

Infection attacks compromise the training phase of an ML model, introducing malicious changes that affect its behavior during production.

  • Poisoning Attacks: These attacks aim to corrupt the training data, thereby sabotaging the model’s learning process. Nelson et al. (2008) highlighted how injecting misleading or malicious data into the training set could shift the model’s decision boundaries, reducing its accuracy or introducing systematic errors. For example, poisoning a spam filter by adding cleverly designed spam messages could make it misclassify similar spam as legitimate emails.

    • Impact: Poisoning attacks are particularly dangerous for systems trained on public or external data, where attackers can inject tainted samples unnoticed.
  • Backdooring Attacks: In these attacks, the attacker embeds a secret trigger into the model during training, such as a specific pattern or image. When this trigger is present in the input, the model performs a specific action defined by the attacker, regardless of its normal behavior. Liu et al. (2017) demonstrated how a backdoor in a facial recognition system could misidentify individuals wearing a specific pair of adversarial glasses.

7.2.3 Exfiltration Attacks

Exfiltration attacks focus on extracting sensitive information from AI systems, including training data or the model itself. These attacks pose significant risks to privacy and intellectual property.

  • Membership Inference Attacks: These attacks determine whether a specific data point was part of the training set. Shokri et al. (2017) showed how an attacker could infer sensitive details, such as whether an individual’s medical data was used to train a model, by analyzing how confidently the model responds to specific inputs. These attacks exploit the tendency of ML models to “memorize” training data.

    • Example: An attacker could determine whether a person was part of a study about Alzheimer’s disease based on the model’s output for that individual’s data.
  • Model Inversion Attacks: Inversion attacks attempt to reconstruct sensitive data from the model’s outputs. For instance, Fredrikson et al. (2015) demonstrated how to recover a person’s facial image from a facial recognition model by probing the model with various inputs. This type of attack can reveal personal data, such as medical or biometric information.

  • Model Extraction Attacks: These attacks replicate a model by querying it repeatedly and reconstructing its behavior. Tramer et al. (2016) illustrated how attackers could approximate a proprietary model by observing its responses to various inputs. This not only compromises intellectual property but also risks exposing sensitive patterns or biases encoded in the model.

7.3 Testing and Validation of AI Systems

To prevent such threats to AI systems, there are different practices that can be adopted. In this section we will cover Red Teaming, white box testing and black box testing.

7.3.1 Red teaming practices

Red teaming (https://www.ibm.com/think/topics/red-teaming) is a proactive approach to testing and improving the security of systems, including AI models and systems. It involves simulating real-world adversarial behaviors to identify vulnerabilities and evaluate how well a system can withstand attacks. Unlike traditional security audits or penetration testing, red teaming focuses on mimicking advanced threat actors’ techniques, tactics, and procedures (TTPs). This method helps organizations understand their security posture and anticipate potential attacks. Red teaming is especially valuable for AI systems, as it exposes weaknesses in models, data pipelines, and deployment environments that traditional methods may overlook.

7.3.1.1 Red, Blue, and Purple Teaming in Cybersecurity

To comprehensively assess and improve the security of systems, organizations adopt three types of teams—red, blue, and purple—each playing a specific role as shown in the table below. By combining red, blue, and purple teaming, organizations can holistically address the challenges of securing AI systems. Red teaming provides an overview to potential attacker strategies, blue teaming strengthen defenses, and purple teaming ensures collaboration for improving security.

Team Role Goals Relevance to AI
Red Teams Offensive security professionals who simulate real-world attacks on an organization’s systems. Identify and exploit vulnerabilities, bypass defenses, and avoid detection. For AI systems, red teams may craft adversarial examples, attempt model poisoning, or exploit exfiltration techniques like membership inference or model inversion, highlighting gaps in defenses that could be exploited by attackers.
Blue Teams Defensive IT security professionals responsible for protecting the system and data from threats. Monitor for intrusions, respond to alerts, and continuously strengthen security measures. Blue teams safeguard AI pipelines by implementing secure practices such as access controls, input validation, and anomaly detection to mitigate risks exposed by red team activities.
Purple Teams A cooperative collaboration process between red and blue teams. Facilitate knowledge sharing, improve communication, and ensure continuous improvement in organizational security. Purple teams integrate insights from red and blue teams to recommend mitigations, improves system defenses, and validate that improvements address identified weaknesses without introducing new risks.

7.3.1.2 Red Teaming in Practice

Red teaming begins with a clear objective, often defined in collaboration with other parties. Ethical hackers, known as red team members, mimic real-world attackers’ tactics, techniques, and procedures (TTPs). The process is non-destructive and strictly follows a code of conduct to ensure no harm is done to the organization’s systems or data.

Red teams use a variety of tools and methods to simulate attacks:

  • Social Engineering: Techniques like phishing and vishing to trick users into revealing sensitive information.
  • Network Sniffing: Monitoring traffic to gather configuration details and credentials.
  • Application Penetration Testing: Identifying coding flaws, such as SQL injection vulnerabilities.
  • Tainting Shared Content: Embedding malware in shared drives to test lateral movement.
  • Brute Force Attacks: Guessing passwords using lists of commonly used credentials or breached datasets.

For AI systems, red teaming involves:

  • Crafting adversarial inputs to test robustness.
  • Simulating data poisoning attacks to evaluate the security of training pipelines.
  • Testing the resilience of deployed models against extraction or inference attacks.

7.3.1.3 Advances in red teaming practices

Recent advancements in red teaming, as outlined by OpenAI https://openai.com/index/advancing-red-teaming-with-people-and-ai/, focus on combining human expertise and automated tools to identify vulnerabilities in AI systems. OpenAI’s approach introduces structured campaigns that involve external red teamers with diverse expertise and automated methods powered by reinforcement learning (RL) to probe AI systems at scale.

For manual red teaming, OpenAI emphasizes creating a diverse team of experts, ranging from cybersecurity specialists to domain-specific researchers, to assess models across varied use cases. This approach ensures that red teaming campaigns are tailored to specific AI models, with clear goals and structured testing processes. Automated red teaming complements this by using AI to generate diverse and effective attack scenarios, with techniques like multi-step RL and reward mechanisms that prioritize both success and novelty of attacks. These methods allow for testing a wide range of vulnerabilities, such as prompt manipulation and misuse of capabilities, in a systematic and scalable manner.

The novel combination of manual and automated approaches provides deeper insights into potential risks, allowing for more robust safety evaluations and better training of AI models to handle real-world threats. While red teaming isn’t a comprehensive solution to all risks, these innovations mark a significant step toward making AI systems safer and more reliable.

7.3.2 Other testing approaches: white box and black box testing

In machine learning, white box testing and black box testing are two basic methods of model reliability, functionality, and security testing. Each provides insight into different aspects of model behavior and complements the other in a robust testing strategy.

7.3.2.1 Black Box Testing: External Behavior Analysis

Black box testing is a testing of the model based on the input-output behavior of the model, without considering its internal structure. The testers feed different inputs into the model and compare its output against expected results to detect errors in functionality, usability, or performance.

  • Advantages: Black box testing is particularly good for finding issues such as bias, fairness, or unexpected outputs in real-world scenarios. It requires no knowledge of the model’s internal algorithms, making it suitable for testing pre-trained or proprietary models.
  • Limitations: Since it doesn’t go inside the model’s internal mechanism, black box testing cannot pinpoint the root cause of identified issues, often requiring additional debugging or analysis.

7.3.2.2 White Box Testing: Internal Examination

White box testing goes into further detail regarding the model’s architecture, code, and algorithms. This form of testing consists of test evaluators using internal processes, inspecting code for vulnerabilities, and pinpointing inefficiencies or bottlenecks.

  • Advantages: This approach helps in finding bugs in coding, model performance optimization, and ensuring security against adversarial manipulation. For instance, the testers may investigate how data flows through the model to find potential leakages or vulnerabilities in handling sensitive input.
  • Limitations: White box testing is resource-intensive and requires significant technical expertise. It may overlook issues related to external factors like data distribution or user interaction.

7.3.2.3 Combining Approaches

A robust testing strategy often incorporates both methods:

  • Black box testing evaluates the model’s performance and functionality as perceived by end-users.
  • White box testing ensures internal correctness, security, and optimization.

By using these complementary techniques, organizations can thoroughly test machine learning models, ensuring they are both effective and secure in real-world applications.

NoteBox: The case of LLMs and the top security threats in Generative AI

OWASP has released two important guidelines, OWASP Top 10 for LLM Applications (2023 and 2025). LLMs and Generative AI with LLMs have faced huge success and exposure, but the rapid evolution of these technologies has amplified the threats that these systems have been exposed to. The 2025 guidelines build on the 2023 one, addressing new vulnerabilities from the increased integration of LLMs into multimodal systems, such Retrieval-Augmented Generation (RAG, when a further source of data is attached to the prompt that is passed to the LLM), and agentic architectures (basically programs based on LLMs that are able to spawn more instances to complete tasks indepdently of human intervention).

The 2025 version highlights the emergence of System Prompt Leakage, a risk stemming from assumptions about prompt isolation, and expands on Excessive Agency, reflecting growing concerns about granting LLMs autonomy. Additionally, the updated list introduces Vector and Embedding Weaknesses, addressing risks in embedding-based methods critical for grounding model outputs. The redefined Unbounded Consumption broadens the earlier focus on Denial of Service to include cost management in large-scale deployments. These updates reflect the evolving complexity of securing LLMs, urging a community-driven approach to adapt defenses to new vulnerabilities.

While LLMs do not necessarily process personal data and memorize the training sets used, it is often possible to extract personal data with specific prompts, whether the user is acting with or without malicious intent. The European Data Protection Board (European Data Protection Board (EDPB) 2024) further emphasizes that models trained on personal data cannot be assumed to be anonymous, given the possibility of re-identification or indirect extraction of data from model outputs. The EDPB opinion further stresses the importance of conducting robust balance tests that considers the risks of data processing against the rights and freedoms of individuals, ensuring that any data use is proportionate and justified within the GDPR.

7.4 AI model alignment

Considering a more holistic view of the threats of AI systems, some mitigation strategies could also happen within the model itself by identifying malicous inputs to avoid generating unwanted (or unlawful) outputs.

AI alignment (Ji et al. 2023) tries to ensure that AI systems function in accordance with human intentions and values. There are four central principles of alignment: robustness, interpretability, controllability, and ethicality (RICE). These principles guide the forward alignment processes (training models to align with specified objectives) and backward alignment (assessing and governing systems post-training and deployment). Forward alignment emphasizes techniques like reinforcement learning from human feedback (RLHF) and adversarial training, while backward alignment involves assurance mechanisms such as interpretability tools and governance frameworks to monitor risks throughout the lifecycle of AI systems.

CautionHint for instructors

Depending how the instructor is developing the course, and on the level of the participants, this module could be expanded with more practical examples. A good homework to consider is to look at how alignment has failed in the past, or how red-teaming has been described in details in papers such as

7.5 Summary

This chapter focused on testing and validating AI systems, addressing key threats, testing methods, and model alignment to ensure that outputs are not going to cause unwanted data breaches or other sorts of security issues. At this stage the AI model is trained, the AI system is tested, and we can move to Secure Deployment Practices.

CautionExercise 7.1: Multiple choice questions
Question Options
1. What is the main purpose of the “Verification and Validation” stage in the AI lifecycle? 1) To finalize model deployment.
2) To test the system against predefined requirements and stress it under potential threats.
3) To design the AI architecture.
4) To deploy training pipelines.
2. Which of the following is a development-time threat to AI systems? 1) Evasion attacks.
2) Data poisoning.
3) Model theft.
4) Denial of service (DoS).
3. What is an example of a manipulative attack on AI systems? 1) Data poisoning.
2) Adversarial reprogramming.
3) Membership inference.
4) Supply chain attacks.
4. Which type of attack focuses on extracting sensitive data from AI systems? 1) Model inversion attacks.
2) Evasion attacks.
3) Adversarial reprogramming.
4) Denial of service (DoS).
5. What is the role of red teaming in AI system security? 1) To deploy AI systems securely.
2) To simulate adversarial behaviors and identify vulnerabilities.
3) To create new training data.
4) To monitor runtime performance.
6. What is a key limitation of black box testing? 1) Requires deep technical knowledge of the model.
2) Cannot pinpoint the root cause of identified issues.
3) Focuses only on internal processes.
4) Is resource-intensive and time-consuming.
7. How does white box testing differ from black box testing? 1) It evaluates external behavior only.
2) It examines internal model processes and algorithms.
3) It uses no prior knowledge of the system.
4) It targets real-world user inputs.
8. What does “AI alignment” aim to achieve? 1) Reduce training time.
2) Ensure the system functions according to human intentions and values.
3) Improve model performance.
4) Simplify deployment pipelines.
9. Which is a potential risk of model inversion attacks? 1) System downtime.
2) Revealing sensitive training data.
3) Misclassifying inputs.
4) Generating offensive outputs.
10. What is the primary focus of the OWASP guidelines for LLM applications? 1) Improving model explainability.
2) Addressing vulnerabilities specific to LLMs.
3) Optimizing system performance.
4) Improving training data collection.

Click to reveal solutions

  1. Answer: 2) To test the system against predefined requirements and stress it under potential threats.

    Explanation: This stage ensures the AI system functions as expected and identifies vulnerabilities.

  2. Answer: 2) Data poisoning.

    Explanation: Data poisoning manipulates training data to compromise the model’s functionality.

  3. Answer: 2) Adversarial reprogramming.

    Explanation: This attack hijacks a model to perform unintended tasks during production.

  4. Answer: 1) Model inversion attacks.

    Explanation: These attacks reconstruct sensitive data from model outputs.

  5. Answer: 2) To simulate adversarial behaviors and identify vulnerabilities.

    Explanation: Red teaming proactively tests system robustness against potential threats.

  6. Answer: 2) Cannot pinpoint the root cause of identified issues.

    Explanation: Black box testing focuses on input-output behavior, not internal mechanisms.

  7. Answer: 2) It examines internal model processes and algorithms.

    Explanation: White box testing involves a detailed internal review of the model’s structure.

  8. Answer: 2) Ensure the system functions according to human intentions and values.

    Explanation: AI alignment focuses on ensuring ethical and robust system behavior.

  9. Answer: 2) Revealing sensitive training data.

    Explanation: Model inversion can extract personal or confidential data from model outputs.

  10. Answer: 2) Addressing vulnerabilities specific to LLMs.

    Explanation: OWASP guidelines target security risks unique to large language models.