
Testing, Verification and Validation of Artificial Intelligence Systems
Modern artificial intelligence systems fundamentally differ from traditional software systems. Their behavior is shaped not only by algorithmic logic but also by statistical models trained on data, making the challenges of reliability, correctness, and interpretability particularly complex. The course aims to develop an engineering culture for the design, verification, and operation of intelligent digital systems.
Students master the structure of the project lifecycle, methods for analysing user needs, conducting pre‑project assessments, managing requirements and configurations, organising development teams, and performing schedule planning and project implementation control.
The course ensures a smooth transition from the model to the system, then to quality control, and finally to operational reliability.
Any verification system must begin with an analysis of the system’s goals, user needs, and criteria for result correctness. The goal‑setting phase determines acceptable error margins, quality criteria, and the system’s applicability limits.
The discipline is built around the idea that an AI system is a human‑computer system: the model makes some decisions, while humans make others. Accordingly, verification processes must account for both decision‑making natures. The methodological axis of the course follows this sequence: goal setting → requirements → model → behaviour → verification → interpretation → validation.
Special attention is given to the nature of AI system errors: model hallucinations, logical reasoning errors, statistical artefacts, effects of biased training data, and incorrect generalisations. The course fosters a culture of experimental verification: AI system testing should be based on experiments, comparative tests, model behaviour analysis, and systematic error diagnostics.
Large Language Models (LLMs) are used in the course not only as objects of analysis but also as tools for testing, critical analysis, and building verification systems for AI solutions. LLMs serve in several roles:
- as a testing object — complex models whose behaviour must be checked, analysed, and verified;
- as an analytical tool — for analysing results, detecting errors, formulating alternative hypotheses, and examining decision‑making logic;
- as a peer‑review tool — in cross‑verification practices where one model analyses the results of another;
- as a component of agent‑based verification systems — for automatic error diagnostics, result analysis, test generation, and hypothesis testing.
At the same time, a mandatory verification principle applies: any result obtained using LLMs must undergo additional verification through experimentation, model comparison, and expert assessment. The professional principle of the course is: trust the verification, not the confidence of the model.
Upon completing the course, students will:
know the differences between testing, verification, and validation; the nature of AI model errors; sources of hallucinations and incorrect conclusions; methods for analysing model behaviour; the architecture of AI testing systems; automated result verification methods; the role of expert knowledge in validation; and the architecture of human‑in‑the‑loop verification systems;
be able to identify errors and unstable behaviour in AI models; design tests for AI components; analyse the correctness of model results; develop verification procedures; apply LLMs to analyse model behaviour; integrate expert verification into AI systems; and design AI quality control systems;
possess skills in diagnosing model behaviour; constructing test scenarios; analysing model errors; conducting comparative model testing; designing AI verification systems; and using LLMs as analytical verification tools.
OBJECTIVES
Understanding the nature of AI system errors;
Mastering methods for testing models and AI components;
Studying methods for verifying AI results;
Mastering approaches to systemic validation of decisions;
Understanding the role of humans in AI verification processes;
Mastering the architecture of AI quality control systems;
Developing skills in building automated result verification systems;
Fostering a culture of critical analysis of model behavior.
KEY TASKS
Studying the nature of errors and unstable behavior in AI models;
Analyzing limitations of statistical models;
Learning methods for testing AI components;
Mastering techniques for analyzing model errors;
Studying result verification methods;
Exploring the architecture of AI testing systems;
Mastering automated result verification methods;
Studying mutual model checking techniques;
Examining agent‑based decision analysis systems;
Applying human‑in‑the‑loop verification methods;
Integrating domain knowledge into verification systems;
Designing AI quality control systems.
Main topics of the course:
Part One
1. AI systems as human‑computer systems. The topic examines AI as a human‑machine decision support complex and highlights the differences between traditional software and AI systems, emphasising the role of humans in interpreting results.
2. Understanding goal setting in AI systems. This topic explores the hierarchy of values, needs, goals, and objectives in intelligent system design, stressing the importance of explicit goal formulation at the start of development.
3. Goal consistency analysis. The topic considers goal conflicts in complex socio‑technical systems and introduces methods for identifying and resolving them through compromise solutions.
4. Degrees of freedom in goal setting and constraints. It discusses the solution space shaped by system goals and the impact of regulatory and technical constraints on AI system design.
5. Criteria and metrics for evaluating AI systems. The topic covers the transition from system goals to performance criteria and quantitative metrics, helping to define measurable indicators for AI success.
6. Requirements for AI systems. It explains how to translate performance criteria into functional and non‑functional requirements and guides the creation of a structured requirements list.
7. AI system architecture: data pipeline, ML pipeline, inference pipeline. The topic provides an overview of key architectural components of AI systems and guides students in developing architectural diagrams.
8. Architecture of LLM systems and Retrieval‑Augmented Generation (RAG). It introduces the structure of LLM‑based systems and RAG, including hands‑on experience in building a simple RAG application.
9. Context and subject area of AI systems. The topic highlights how the correctness of AI results depends on the subject context and guides students in defining the domain context of their project systems.
10. Enriching validation with domain knowledge. It covers methods for integrating knowledge bases and regulatory documents into AI verification, helping to build a project‑specific knowledge base.
11. Sources of AI system errors: data, models, architecture. The topic identifies common sources of errors in AI systems and teaches students to anticipate and document potential issues early.
12. Glitches in generative models and hallucinations. It focuses on typical errors in LLMs, such as hallucinations, and guides students in analysing and documenting real model errors.
13. Features of AI system testing. The topic introduces key approaches to testing AI systems and helps students develop test scenarios and build test sets.
14. Adversarial testing and stress testing of models. It covers techniques for provoking model errors through adversarial queries and designing stress tests to assess system robustness.
15. Verification and validation in AI systems. The topic clarifies the distinction between verification and validation and guides students in preparing analytical reports on AI verification cases.
16. Testing the reasoning and factual validity of AI results. It teaches how to evaluate the logical and factual correctness of AI outputs, using language model responses as a practical example.
Part Two
1. Automated verification of AI results. The topic discusses scalability challenges in AI verification and explores architectures for automated verification, including their role in modern AI applications.
2. Using LLM to evaluate results (LLM as a judge). It covers methods for using language models to evaluate other models’ outputs, analysing the advantages and limitations of this approach.
3. Self‑consistency and mutual checking of models. The topic introduces techniques for cross‑verifying results from multiple models and using ensembles to improve reliability.
4. Retrieval‑based verification. It explores methods for validating AI outputs using external knowledge sources and search engines, helping students integrate fact‑checking mechanisms.
5. Agent‑based AI error detection systems. The topic presents architectures using specialised agents (critics, opponents, fact checkers) and guides students in designing agent‑based verification frameworks.
6. Systematic search for AI bugs. It teaches systematic methods for identifying AI errors, including adversarial query generation and building comprehensive test scenarios.
7. Expert validation of AI systems. The topic emphasises the role of domain experts in verification and guides students in developing expert review checklists and defining expert roles in system architecture.
8. Contextual verification of AI systems. It highlights how AI result correctness depends on subject context and domain knowledge, guiding students in generating detailed domain descriptions.
9. Verification pipeline. The topic covers the architecture of an AI testing pipeline (input validation, retrieval validation, response verification) and helps students design a pipeline for their project.
10. AI reliability engineering. It introduces principles for ensuring AI system reliability, including monitoring, observability, and continuous quality assessment, and guides students in finalising their system’s reliability architecture.