Observability Technology

Modern digital services are built on a microservice architecture using container infrastructure, dynamic scaling, and frequent releases. Under such conditions, system stability largely depends on its observability — the ability to provide data that enables analysis of the current state and identification of the root causes of emerging issues. The course treats observability as an independent engineering discipline, covering telemetry collection, monitoring and metrics analysis, centralized logging, distributed request tracing, incident diagnostics, and assessment of the operational resilience of digital services.

Within the course, observability of software systems is considered an integral element of the engineering architecture of modern digital services. The learning process follows the trajectory: Telemetry → Monitoring → Tracing → Incident Analysis → Reliability Engineering. Students study observability as a set of interconnected subsystems: metrics (quantitative indicators of system state), logs (event information about component operation), and traces (data on the sequence of distributed query execution). Integrating these telemetry sources enables effective diagnosis of complex operational problems.

Special attention is given to the relationship between observability and Site Reliability Engineering (SRE) practices. Learners master the use of Service Level Indicators (SLI), Service Level Objectives (SLO), error budgets, and engineering methods for analysing operational incidents.

An innovative element of the course is the use of Large Language Models (LLMs) as a tool to support engineering activities. LLMs are applied to analyse telemetry (interpret metrics, logs, and traces); consult on observability technology principles and monitor architecture analysis; diagnose incidents (analyse operational data and identify potential root causes). A key learning principle is the mandatory verification of all conclusions obtained using LLMs — students validate them through engineering analysis and practical experiments.

Upon completing the course, graduates acquire competencies as specialists in DevOps Engineering, Service Reliability Engineering (SRE), Platform Engineering, and Cloud Infrastructure Engineering. They know the principles of information system observability, monitoring system architecture, and telemetry collection methods. They are able to design observability architecture, analyse metrics and telemetry, diagnose performance issues, and formulate reliability indicators for digital services. Practical skills include working with telemetry monitoring and analysis tools, methods for diagnosing operational incidents, and critically evaluating results obtained with the help of LLMs.

To study the architecture of observability systems for software systems;

To master methods of collecting telemetry from information systems;

To study methods for monitoring software system metrics;

To master centralized logging technologies;

To study of distributed query tracing methods;

To master methods of system performance analysis;

To study of service reliability engineering (SRE) practices;

To master methods for diagnosing operational incidents;

Development of skills in analyzing operational data;

Formation of a culture of using LLM as an engineering analysis tool.

Main topics of the course:

1. Introduction to Information Systems Observability. Covers the evolution of information systems operation, DevOps and SRE approaches, the distinction between monitoring and observability, and the primary telemetry sources (metrics, logs, traces), with a focus on identifying observability points in microservice architectures.

2. Telemetry in Distributed Systems. Explores telemetry collection architecture, types of metrics and events, and telemetry standards (e.g., OpenTelemetry), teaching students how to instrument code for data collection and analyse telemetry tools.

3. Metrics and Monitoring Systems. Focuses on application and infrastructure metrics, time‑series data, and monitoring system architecture, guiding students in configuring monitoring for microservices and visualising performance indicators.

4. Logging and Event Analysis. Introduces structured logging and centralized logging systems, covering event correlation in distributed environments and teaching students to centralise logs and identify errors through semantic analysis.

5. Distributed Tracing. Addresses challenges in analysing microservice interactions, covering service call chains and execution delays, and helps students set up distributed tracing to identify system bottlenecks through trace analysis.

6. Observability of Container Platforms. Examines observability in container orchestration systems (e.g., Kubernetes), including container and cluster metrics, and teaches students to monitor containerised applications and diagnose infrastructure issues.

7. Service Reliability Engineering. Covers reliability metrics (SLI, SLO) and error budgets, guiding students in developing reliability models for digital services and applying SRE practices to maintain service quality.

8. Performance Analysis of Information Systems. Focuses on identifying system bottlenecks and capacity planning, teaching students to conduct performance analysis of microservice applications and formulate hypotheses about performance degradation causes.

9. Incident Management and Root Cause Analysis. Covers incident management processes and postmortem analysis, helping students diagnose application failures, structure telemetry data, and prepare detailed reports on incident causes and solutions.

10. Observability and Intelligent Operations (AIOps). Explores automatic anomaly detection and self‑healing systems, covering integration of observability tools into CI/CD pipelines, and teaches students to leverage LLMs for telemetry analysis and operational automation opportunities.

DOWNLOAD THE FULL COURSE SYLLABUS

BACK TO THE CURRICULUM

BACK TO MASTER'S PROGRAM

ABOUT HES MEPHI

OBJECTIVES

Formation of a systemic understanding of the principles of observability of information systems;

Mastering methods of analyzing telemetry of distributed software systems;

Development of skills in diagnosing operational problems of digital services;

Mastering engineering practices to ensure the reliability of software systems;

Formation of Site Reliability Engineering competencies ;

Development of skills in applying LLM for telemetry analysis and engineering diagnostics.

KEY TASKS

To study the architecture of observability systems for software systems;

To master methods of collecting telemetry from information systems;

To study methods for monitoring software system metrics;

To master centralized logging technologies;

To study of distributed query tracing methods;

To master methods of system performance analysis;

To study of service reliability engineering (SRE) practices;

To master methods for diagnosing operational incidents;

Development of skills in analyzing operational data;

Formation of a culture of using LLM as an engineering analysis tool.

Main topics of the course:

HES MEPhI

VK / Vkontakte

Yandex.Dzen

MAX

Telegram

Youtube

Rutube

NRNU MEPhI Admissions Committee: