8  Deployment and Monitoring of AI systems

TipLearning outcomes

After completing this chapter, you will:

  • Learn best practices for securely deploying AI systems
  • Understand the security risks associated with production environments, and how to mitigate these risks
  • Understand the importance of monitoring AI systems for performance degradation, drift, and bias to ensure models performance and alignment are still valid
  • Learn about various techniques for detecting drift, including data drift and concept drift, and how to implement automated alerts for corrective action.
  • Explore the role of incident response plans and ethical monitoring in maintaining transparency and fairness throughout the AI system’s lifecycle.

After all the efforts of securely developing and testing the AI system, it is now to turn it into an actual product that can be used by users. Deployment can mean different things according to where/when the model will actually be used. It could be through a web interface (a chatbot, an API to query the AI system), it could be embedded on a device, or it could also be integrated into an existing software pipeline where it runs periodically or in real-time as part of a larger system. Regardless of the setup, going “live” introduces new security and privacy challenges: the model is now exposed to users, environments, and data flows that most likley were not part of the development or testing stages: expect the impossible! Because of this, deployment must include robust monitoring, access control, and other mechanisms to detect misuse, data drift, or potential leaks of the training or input data.

8.1 Transparency

While one might think that transparency is the last thing to consider when bringing your AI system to users, it is actually a fundamental requirement to build users’ trust and make sure the product is in line with both GDPR and the AI Act regulations. Without going into the details of (personal) data or AI governance, here we focus on practical transparency best practices that should accompany the deployment and public release of your AI system or model:

  • Clear privacy notice explaining in simple language what personal data is collected, how it’s used in model training or processing, and how it’s protected and ensure there is a valid legal basis (European Data Protection Board (EDPB) 2024)
  • AI usage disclosures: Inform users when they are interacting with an AI system or when AI is used in a service. You can also explain model scope and limitations and be honest about what the AI application can and cannot do.
  • Data handling transparency and other needs for control or consent it is important to mention for example if the users’ data is being logged or reused for new purposes even in formats that might not include personal data
  • Documentation and explainability: finally documentation of the new AI system/model is fundamental to ensure transparency and consider also explainability on how the AI makes decisions or why it produced a certain output. An excellent tool for ensuring transparency of an AI model (system) is the model card (or system card) as described in Mitchell et al. (2019).
NoteModel cards

Model cards are a documentation tool for transparency that provide a comprehensive snapshot of a model’s characteristics and ethical considerations​. A model card should accompany any AI model trained on personal data (or any important model) to detail its intended use, performance, and the data it was trained on (check “5 things to know about AI model cards” by Desai (2023))

Depending on the type of model you are deploying, it is important to explore existing model/system cards for example by looking at model cards provided by OpenAI, NVIDIA, or Google. Model cards were actually introduced by Google in Mitchell et al. Mitchell et al. (2019) so a great starting point is the table provided in the paper, which is annotated here below:

An annotated model card template based on Mitchell et al. (2019)
Section Subsection Description / What to Include
Model Details Person or organization Name of developer(s) or organization responsible for the model
Model date Date the model was developed or released
Model version Version identifier for the model
Model type E.g., classification, regression, transformer, etc.
Training details Description of training algorithms, hyperparameters, fairness constraints, regularizations, and feature types
Reference resources Link to paper, website, or documentation with further info
Citation details Citation for academic referencing
License Terms under which the model can be used (e.g., MIT, CC BY)
Contact info Email or web form to reach model developers
Intended Use Primary intended uses Real-world scenarios where the model is designed to be used
Primary intended users Types of users expected (e.g., clinicians, developers, students)
Out-of-scope use cases Known limitations or areas where the model should not be applied
Factors Relevant factors Characteristics that affect model performance (e.g., age, gender, device type)
Evaluation factors Factors used when evaluating the model (e.g., subgroup performance, lighting conditions)
Metrics Model performance measures Metrics used (e.g., accuracy, F1-score, AUROC)
Decision thresholds Thresholds used to turn scores into decisions (e.g., probability > 0.5)
Variation approaches Methods for robustness checks (e.g., stratified evaluation)
Evaluation Data Datasets Names and descriptions of datasets used for evaluation
Motivation Why those datasets were selected for evaluation
Preprocessing Steps taken to prepare the data (e.g., normalization, filtering)
Training Data If available: source, composition, collection process, and any known limitations of the training data
Quantitative Analyses Unitary results Performance across individual factors (e.g., age groups)
Intersectional results Performance across combinations of factors (e.g., young + female)
Ethical Considerations Discussion of fairness, biases, potential harms, and mitigation strategies
Caveats and Recommendations Known weaknesses, areas needing caution, and suggestions for users

8.2 Basics of AI system deployment

When looking at Huyen (2022), the author stresses how important it is to avoid some of the typical assumptions related to AI system/model deployment. Here some (wrong) assumptions that are worth considering:

  • You only deploy one or two ML models at a time. In reality, companies have many ML models, and an application may require multiple models for different features. For example, a ride-sharing app needs models to predict ride demand or drivers’ availability. Additionally, if the application operates in multiple countries or cities, each country/city may need its own set of models. Some companies have hundreds or even thousands of models in production!

  • If we don’t do anything, model performance remains the same. ML systems can degrade over time due to software rot and data distribution shifts, especially when the data encountered in production differs from the training data. Your AI models/systems tend to perform best right after training/deployment and will inevitably degrade over time.

  • You won’t need to update your models as much. Since model performance decays over time, models should be updated as frequently as possible. Some companies update their models multiple times a day. The right question to ask is “how often can I update my models” not “how often should I update my models”.

  • Most ML engineers don’t need to worry about scale. Many companies, even those with 100+ employees, need to be able to scale their ML applications. Scaling will always be a concern for the whole team when the product is successful, and this means designing and deploying an AI system that can serve many queries per second or millions of users per month (fun side quest: search for “Our GPUs are Melting”, quote by OpenAI CEO in March 2025)

Now that the myths are cleared, it is also important to consider what type of AI system is being deployed and Huyen (2022) introduces two types of systems:

  • Batch prediction generates predictions periodically or when triggered, stores them, and retrieves them as needed
  • Online prediction generates and returns predictions as soon as requests arrive, and is also known as on-demand prediction.

While basic principles are important, they are not the core focus of this book. Similarly other issues related to scalability, highly depends on what system the AI system is installed on, and more generally they are issues that any application might face, even without any AI involved. If we go back to our focus on data protection and security, we next consider a checklist of various aspects to consider before and during deployment (for further readings see Ahmad et al. (2024)).

8.3 A secure deployment checklist

So what are the steps to securely deploy your AI model/system?

  1. Secure deployment infrastructure : similar as mentioned in previous chapters, it is important that the code is able to run in the production computer clusters. If during development it is possible to work on firewalled clusters or trusted environments, when the system goes in production all new sources of threats are now available due to the expanded attack surface of the deployed system.

  2. Access Control: when dealing with AI models trained with personal data or AI systems processing personal data, you might need to consider implementing access control. This is to make sure that only certain users can access the sensitive AI system. When user inputs also contain personal data, access control is also needed to store other data that the user might be providing to query the AI system.

  3. Model integrity, version control, continuous integration/continuous delivery (CI/CD): version control (of code, of data, and of models) and CI/CD practices are not only necessary to ensure the reproducibility and transparency of the development work, it becomes even more important during deployment. With cryptographic hashing it is possible to verify the integrity of deployed models and with strict version control with model registries it is always possible to control for model updates and set-up rollback mechanisms in case of compromised updates. Finally with CI/CD we can set up automated pipelines to build, test, and deploy models reliably includes automating 1) unit tests for model code, 2) integration tests for data pipelines, and 3) validation tests on model performance

NoteContainerization and infrastructure-as-code

While not specific to AI systems processing personal data, you will most likely going to deploy your model using containers over computing nodes that are automatically managed with the so called infrasctructure-as-code.

Infrastructure as Code (IaC) is the practice of managing IT infrastructure using machine-readable/actionable configuration files, allowing for automation, consistency, and version control across environments. Instead of manually configuring servers and networks, developers define infrastructure through code, which can be tested and reused like software. Common IaC tools include Terraform, Ansible, Pulumi, AWS CloudFormation, and Chef. For more details, see “Infrastructure as Code” Morris (2025).

What are the risks related to containers and infrastructures? A short list of what is important to consider when working with containers and infrastructures managed for example with Kubernetes. Each topic could easily have a course of its own, the goal here is to present the reader with examples and terms and let them explore further if/when they are going to need these tools for their work.

  • Isolation and sandboxing
    • Although containers offer isolated environments, they are not completely immune to privilege escalation. Use namespaces and cgroups to enforce strict isolation.
    • Avoid running containers with root privileges; ensure containers operate with the least amount of privilege necessary.
  • Container image security
    • Regularly update and patch container images to fix known vulnerabilities. Tests and updates should always be managed via version control and CI/CD.
    • Use only trusted and verified container images (e.g., from DockerHub or even better from private registries).
    • Scan images for vulnerabilities using tools like Clair, Trivy, or Aqua Security, and make these part of your CI/CD pipeline.
  • Secrets management
    • Avoid hard-coding secrets (API keys, credentials) in container images. Instead, use secure vaults or environment variables for secret management.
  • Network Security
    • Implement strict network policies to control container-to-container and container-to-host communication.
    • Segment containers based on sensitivity (e.g., separating user-facing services from internal model services).
  • Control Plane Security
    • In Kubernetes it is important to harden the Kubernetes control plane by limiting access and enabling audit logs.
    • Use role-based access control (RBAC) to define clear roles for users and services interacting with the Kubernetes API.
  • Pod Security
    • Implement Pod Security Policies to enforce restrictions on what containers can do (e.g., restricting privilege escalation, disallowing the use of host resources).
    • Ensure that all Kubernetes pods run with non-root users.
  • Network Policies
    • Enforce Kubernetes Network Policies to control the flow of traffic between pods and services.
    • Secure intra-cluster communication by encrypting data with mTLS (mutual Transport Layer Security).
  • Supply Chain Risks for Kubernetes Plugins
    • Be cautious with third-party plugins (e.g., CNI plugins, ingress controllers) as they could introduce security vulnerabilities. Verify and maintain plugins regularly.

8.4 Monitoring AI Systems

Deployment is of course tightly interwined with monitoring. While monitoring of infrastrctures and systems in general is fundamental even without any AI, monitoring AI systems adds a new lawyer of performance monitoring that becomes crucial for ensuring that the AI system remains aligned with the intended scopes and ensure ethical and regulatory standards during its use. Over time, models can experience performance degradation, or “drift” due to changes in input data patterns or other external conditions. Similarly, biases may emerge post-deployment that were not evident during training. Ongoing monitoring helps identify these issues early and take corrective action.

In this section we will not consider the monitoring of infrastructure (network, access control, application usage) instead we will focus on AI specific issues related to monitoring of AI systems that have been deployed.

8.4.1 AI model drift detection

AI models can drift when some statistical properties of new input data change over time. With concept drift model accuracy might be degrading or we might start noticing unsafe predictions and unwanted behaviour from the AI system. Possible thing to consider include:

  • Data distribution tracking: Continuously track input data statistics (feature distributions, correlations, etc.) and compare them to the training baseline. Significant shifts (e.g. via Kolmogorov-Smirnov tests or population stability index) can signal data drift before performance issues emerge.
  • Model performance monitoring: With classifiers or other tools which provide outcomes and labels, accuracy or error rates can be evaluated when the ground truth of predicted label is known. Sudden drops can indicate concept drift affecting the model’s predictive relationship. This can be of crucial importance if AI systems that are helping in decision making in healthcare or finance. For example in a recent study Kore et al. (2024), tracking only aggregate accuracy failed to detect a significant COVID-19-induced data shift, whereas a dedicated drift detector caught it.
  • Drift detection algorithms: Unsupervised drift detectors algorithms (e.g. ADWIN, DDM) can raise alerts when model outputs or data distributions change statistically, even before ground truth is known. These methods can monitor prediction probability distributions or feature importances for unexpected changes.
  • Responsive adaptation: If possible, it is good to establish thresholds, so that model degradation can be automatically evaluated and then it can be decided if a model needs to be re-trained or rolled back to a previous version.
NoteHow to detect model drift?

By comparing data distributions over time using statistical or machine learning methods to identify significant changes. There are various techniques and tools for detecting model drifts, and domain specific knowledge might help you decide on what is the best tool for your case. Here we follow Hinder, Vaquet, and Hammer (2024) and the four stages that they recommend:

  1. Select data windows (Stage 1: Acquisition)
    Use one or two time-based windows to collect data for comparison. Common strategies include sliding, fixed, growing, or model-based reference windows.

  2. Describe the data (Stage 2: Descriptor Building)
    Convert raw data into structured representations using descriptors like histograms, decision trees, kernel matrices, or machine learning embeddings to capture the distribution.

  3. Measure change (Stage 3: Dissimilarity Computation)
    Apply a distance or divergence measure (e.g. Total Variation, Hellinger, MMD, KL divergence) to quantify the difference between distributions in the selected windows.

  4. Normalize and evaluate (Stage 4: Normalization)
    Normalize dissimilarity scores to correct for estimation variance using statistical techniques (e.g. permutation testing or p-values) to assess if a detected drift is statistically significant.

8.4.2 Bias monitoring

If model drift detection is somewhat more focused on the performance of the AI model/system, bias monitoring adds an ethical layer to drift detection to avoid that AI system outcomes negatively impact individuals of certain demographics. After deployment, it is important to continuously assess model outcomes across diverse demographic or specific sub-groups. For example in healthcare, AI system showed worse accuracy for under-represented patient groups compared to others, indicating a novel bias that was not identified during development Gichoya et al. (2023).

So what should be considered to perform bias monitoring?

  • Slice performance analysis: to track metrics separately for different user groups (e.g. gender, ethnicity, age segments) or other protected attributes.
  • Fairness metrics: beyond accuracy, we can use fairness indicators (demographic parity, equal opportunity difference, etc.) in production monitoring. For example with a classifier, we could compare positive prediction rates or false-negative rates across different groups to see if there are large imbalances between groups.
  • Bias detection tools: There are tools such as IBM AI Fairness 360 or Microsoft Fairlearn to automate checks and highlight significant deviations.

By setting up bias triggers, we can then evaluate if retraining is necessary to ensure that the model does not perpetuate or amplify biases as data evolves.

8.4.3 Privacy risk monitoring

When AI models are trained on or process personal data, privacy risks must be actively monitored. Models/systems could inadvertently leak confidential information about individuals, through model outputs or through other attacks (e.g. membership inference attacks). Success in such an attacks can constitute a privacy breach, since it reveals information about who was in the model’s training data.

To monitor and mitigate privacy risks, one can consider the following:

  • Output scanning / output filtering: Continuously scan model outputs (predictions, generated text, etc.) for any sensitive personal data. For example, for a generative model like a large language model, can implement a regular expression or named-entity detection as output filter to catch if personal data are being regurgitated and eventually block the response before displaying it to the final user. Output filters become a necessary tool, since large models can memorize and output verbatim personal details seen during training (Carlini et al. 2021)).
  • Monitoring model confidence: One potential sign of a privacy leakage is overly confident predictions on certain inputs which could imply that certain data was in the training set.
  • Differential privacy logging: If a model was trained with DP, depending on the DP technique adopted, it can be possible to monitor the privacy loss metric (ε) over time and ensure it stays within acceptable bounds.
  • Privacy audits & penetration testing: While not strictly related to live monitoring, performing regular privacy audits on the model can prevent unwanted privacy risks. This can involve simulating attacks like membership inference or model inversion in a controlled setting to see if the model is vulnerable. F

8.4.4 Other types of monitoring

Other types of monitoring could be considered especially when an AI system is categorised as high-risk. For example one can consider ethical monitoring by monitoring possible misalignment of the AI system during deployment, especially when comparing it with results from development stage and with what is known from the model card. Accountability and explainability monitoring can also be important with AI system and adopting redress mechanism to track complaints or appeals related to AI decisions and addressing them. Establish clear processes for humans to override or correct AI decisions in fundamental not only as a legal requirement, but as a strong ethical principle in AI systems which can affect individuals.

Together, ethical and explainability monitoring help maintain the social license to operate by ensuring transparency, fairness, and accontability. These practices not only support compliance with regulations like the AI Act, but also promote responsible and trustworthy AI over time.

8.5 Incident Response and Recovery Plans

Finally, monitoring should also linked to incident response and recovery plans. These plans outline how to react when an AI system produces harmful, incorrect, or insecure outcomes after deployment. Traditional incident response (as used in cybersecurity) must be now re-adapted to consider AI-specific failures as outlined above.

So what should be done?

  • For an AI incident response team: a cross-organisational team involving data scientsits, engineers, legal and communication exerts can quickly react to unwanted incidents.
  • Prepare for common scenarios: Develop runbooks for likely incidents (e.g., model performance collapse or data pipeline failure or any other detected attacks.
  • Integration with CSIRT Processes: Incorporate AI incidents into the organization’s existing Computer Security Incident Response Team framework (CSIRT). The response process should cover identification, containment, eradication, recovery, and lessons learned, tailored to AI. For example, if a model is compromised or producing biased content, the containment might involve taking the model offline or enabling a safe backup model (see VanHoudnos et al. (2024))
  • Prevention: After an incident is solved, it is good practice to perform a post-mortem analysis focusing on the AI aspects. Was it a data quality issues? Concept drift? Adversarial inputs? How can monitoring be improved?
CautionHint for instructors

This section introduced many new concepts, which can be difficult for learners to fully grasp without practical, hands-on activities. Ideally, learners should have the opportunity to experiment with deployment (including containerization and infrastructure as code) and monitoring, using AWS, Azure, or other similar services that support small-scale testing at little or no cost.

8.6 Summary

This chapter focused on the practical aspects of releasing and deploying an AI system/model that processes (or was trained on) personal data. In the next chapter we will outline on how to react when the model performance is decreasing and what to consider for secure decommissioning of AI systems/models with personal data.

CautionExercise 8.1: Multiple choice questions
Question Options
1. What is a key reason why monitoring is necessary after AI deployment? 1) To reduce training costs
2) To automatically generate new datasets
3) To detect drift and ensure continued alignment with intended use
4) To convert the model into a system card
2. What is the purpose of a model card in AI deployment? 1) To track container resource usage
2) To describe the model’s ethical implications and usage limitations
3) To secure the container registry
4) To ensure legal protection of the developer
3. Which of the following is a valid technique for detecting data drift? 1) OAuth authorization
2) Feature distribution comparison
3) SSH key rotation
4) Container privilege escalation
4. What is a good practice regarding container privileges in secure AI deployment? 1) Use root privileges for all containers
2) Allow containers to share host resources
3) Run containers with the least privilege necessary
4) Disable sandboxing for performance
5. What is the primary function of Infrastructure as Code (IaC)? 1) Encrypt user data before training
2) Dynamically resize training datasets
3) Automate and version control IT infrastructure
4) Containerize neural network weights
6. What kind of monitoring focuses on differences in model outcomes across demographic groups? 1) Performance monitoring
2) Bias monitoring
3) Network monitoring
4) Supply chain monitoring
7. What can an AI system leaking sensitive data via its outputs indicate? 1) Successful concept drift detection
2) An effective model registry
3) A privacy risk or potential breach
4) A well-calibrated output filter
8. Which of the following is a suitable response to detecting significant concept drift in a deployed AI model? 1) Increase container memory
2) Revert to older training datasets
3) Retrain or roll back the model
4) Disable CI/CD pipelines
9. Why should ethical monitoring be implemented in high-risk AI systems? 1) To optimize infrastructure costs
2) To meet environmental regulations
3) To ensure alignment with model card expectations and social responsibility
4) To enable auto-scaling
10. What is the goal of an AI-specific incident response plan? 1) To minimize GPU usage
2) To respond to AI failures such as model bias or adversarial inputs
3) To encrypt training pipelines
4) To disable user access logs

Click to reveal solutions

  1. Answer: 3) To detect drift and ensure continued alignment with intended use
    Explanation: Monitoring after deployment helps identify concept or data drift and performance degradation, ensuring the model remains reliable and compliant.

  2. Answer: 2) To describe the model’s ethical implications and usage limitations
    Explanation: Model cards provide structured documentation for transparency, including ethical considerations, intended use, and performance.

  3. Answer: 2) Feature distribution comparison
    Explanation: Comparing feature distributions over time (e.g., using PSI or statistical tests) is a key method for detecting data drift.

  4. Answer: 3) Run containers with the least privilege necessary
    Explanation: For security, containers should operate under minimal privileges to reduce the risk of exploitation.

  5. Answer: 3) Automate and version control IT infrastructure
    Explanation: IaC allows the use of code to manage infrastructure reproducibly and consistently across environments.

  6. Answer: 2) Bias monitoring
    Explanation: Bias monitoring focuses on evaluating whether AI systems treat different demographic groups fairly during deployment.

  7. Answer: 3) A privacy risk or potential breach
    Explanation: Sensitive data appearing in outputs may indicate memorization of training data and constitutes a serious privacy concern.

  8. Answer: 3) Retrain or roll back the model
    Explanation: Corrective action for significant drift includes retraining with updated data or reverting to a more stable version.

  9. Answer: 3) To ensure alignment with model card expectations and social responsibility
    Explanation: Ethical monitoring ensures that real-world use aligns with documented intentions and protects fundamental rights.

  10. Answer: 2) To respond to AI failures such as model bias or adversarial inputs
    Explanation: AI-specific incident response plans handle incidents involving failures in model performance or unintended outcomes.