Big Data Engineering

The course has a theoretical and applied engineering character. Its goal is not to focus on the isolated mastery of individual technologies, but to develop students’ systemic understanding of how big data processing systems are designed, implemented, and maintained under real engineering constraints.

A modern digital system encompasses not only application logic and an infrastructure platform, but also a dedicated data processing workflow. This workflow covers data origin, methods of data acquisition, storage architecture, processing, result publication, quality control, and data lifecycle management.

Therefore, within the course, big data engineering is treated not as a collection of isolated technologies, but as a fully‑fledged architectural layer of a digital system.

Architectural decisions in data engineering are always made with consideration of constraints imposed by the type of data origin, the method of data ingestion, the structure and variability of sources, quality requirements and deadlines for delivering results, execution environment limitations, and requirements for maintenance and evolution.

Thus, designing a data pipeline and platform means finding a balance between the requirements for the outcome and the actual properties of the data and the operational environment.

Within the course, data solutions are not viewed as standard templates, but rather as a space of alternatives. Students analyse the differences between transactional sources, event‑based sources, APIs, and web sources, as well as between regular data loading and sources with changing contours, centralised versus distributed processing, simple pipelines versus more mature data platforms, and batch logic versus architectures that require a subsequent transition to stream processing. The data pipeline is considered part of the lifecycle of an engineering project.

A distinctive feature of the course is the end‑to‑end use of large language models (LLMs) as a tool for engineering analysis, generating architectural alternatives, project reflection, and supporting students’ independent work.

The course aims to build the ability to select a solution based on the real characteristics of a task, rather than on the popularity of a particular technology.

Understanding data engineering as a distinct engineering layer of digital systems;

Analysing data origin and its influence on architectural decisions;

Mastering principles of data pipeline and data platform design;

Developing the ability to choose architectural solutions depending on data type, ingestion mode and environmental constraints;

Building skills in engineering decomposition and design of data solutions;

Fostering a culture of using LLMs in engineering analysis and project activities.

To develop a systemic view of the place and role of data engineering in the architecture of digital systems;

To study types of data origin and strategies for data acquisition;

To master principles of designing data platform and data pipeline architecture;

To learn data storage models and basic approaches to big data processing;

To master ETL/ELT principles, batch and distributed processing;

To study methods for ensuring data quality;

To analyse principles of orchestration, reproducibility, reliability and observability of data pipelines;

To build skills in project decomposition of engineering data solutions;

To prepare for completing a coursework on the Big Data track topic;

To foster a culture of conscious use of LLMs in engineering design and analysis.

Main topics of the course:

1. Introduction to Data Engineering. The role of data engineering in digital systems, differences from Data Science, analytics, and applied development. The lifecycle of an engineering data task — from problem statement and requirements to solution operation.

2. Data Platform Architecture. Components: data sources, ingestion layer, storage, processing, result publication, monitoring, and quality control. Architecture as a way to reconcile requirements and constraints, not just a set of technologies.

3. Data Origin as a Key Architectural Factor. Analysis of transactional sources, event streams, APIs, and regular data dumps. The impact of data origin on processing mode, update frequency, and quality requirements.

4. Complex Data Acquisition Scenarios. Web scraping, regular collection from external sources, adversarial sources, and changing source scope. Technical, organisational, and legal constraints of data acquisition methods.

5. Lifecycle of a Data Engineering Project. Stages: problem statement, requirements, source analysis, architectural design, pipeline development, quality control, operation, and evolution. The concept of architectural forks.

6. Decomposition of an Engineering Project. Transition from a general idea to a sequence of steps: data collection, preparation, storage, processing, quality control, and result delivery. Analysis of stage dependencies and critical points.

7. Data Storage Principles in Big Data Systems. Local, centralised, and distributed storage; file‑based, table‑based, and object‑based approaches. The impact of storage method on processing and maintenance.

8. Batch Data Processing. ETL and ELT approaches, processing steps, launch window, idempotency, re‑runs, and result reproducibility. Practical construction of a batch processing scenario: reading, cleaning, transformation, aggregation, and result publication.

9. Distributed Data Processing. Reasons for moving computations to a cluster, differences between heavy and light operations. Architectural significance of join, shuffle, and aggregation operations. Analysis of resource‑intensive operations and identification of bottlenecks.

10. Data Quality, Cleaning, and Validation. Handling missing values, duplicates, incorrect data types, outliers, and business rule violations. Differences between technical validation, substantive checking, and data cleaning. Development of data quality rules for a dataset.

11. Pipeline Orchestration and Processing Execution Management. Stage dependencies, scheduling, re‑runs, status control, and logging. Orchestration as a means of ensuring reproducibility and manageability (not just a launch schedule). Construction of a DAG (Directed Acyclic Graph) for a pipeline, analysis of success/failure conditions and re‑execution requirements.

12. Reliability, Maintenance, and Reproducibility of a Data Pipeline. Logging, version control, traceability, failure diagnostics, re‑execution, and change management. Analysis of typical incidents: source schema changes, incomplete loading, partial processing, and result quality violations. Development of a diagnostics and recovery plan for a failure scenario.

13. Performance and Processing Optimisation. Identification of redundant steps, resource‑intensive join operations, repeated data reading, and unjustified intermediate layers. Trade‑off between solution speed and maintainability. Comparison of “naive” and “improved” pipeline variants: how acceleration is achieved and at what cost (increased architectural complexity). Detection of bottlenecks and assessment of optimisation feasibility.

14. Overview Topics: Streaming Processing, Event‑Driven Approach, Data Lake, and Lakehouse. Conceptual introduction to technologies as a bridge to subsequent disciplines in the track (without in‑depth study). Comparison of batch processing, near real‑time processing, and Data Lake/Lakehouse logic using a single case study. Analysis of requirements leading to architectural complexity. Distinction between dashboard, Data Lake, and Lakehouse logic based on complexity, cost, and applicability criteria.

15. Project Integration into a Holistic Engineering System and Pre‑Defence. Presentation of project architecture, implementation stages, constraints, quality rules, and design decisions. Comprehensive project presentation: demonstration of component consistency and justification of architectural and engineering choices. Project refinement based on feedback: addressing weaknesses, clarifying constraints, quality rules, and pipeline stages.

DOWNLOAD THE FULL COURSE SYLLABUS

BACK TO THE CURRICULUM

BACK TO MASTER'S PROGRAM

ABOUT HES MEPHI

The course treats big data not only as large volumes of information, but as a specific class of engineering tasks, where data origin, ingestion mode, storage architecture, processing logic, quality requirements, observability and reproducibility form a unified project system.

OBJECTIVES

Understanding data engineering as a distinct engineering layer of digital systems;

Analysing data origin and its influence on architectural decisions;

Mastering principles of data pipeline and data platform design;

Developing the ability to choose architectural solutions depending on data type, ingestion mode and environmental constraints;

Building skills in engineering decomposition and design of data solutions;

Fostering a culture of using LLMs in engineering analysis and project activities.

KEY TASKS

To develop a systemic view of the place and role of data engineering in the architecture of digital systems;

To study types of data origin and strategies for data acquisition;

To master principles of designing data platform and data pipeline architecture;

To learn data storage models and basic approaches to big data processing;

To master ETL/ELT principles, batch and distributed processing;

To study methods for ensuring data quality;

To analyse principles of orchestration, reproducibility, reliability and observability of data pipelines;

To build skills in project decomposition of engineering data solutions;

To prepare for completing a coursework on the Big Data track topic;

To foster a culture of conscious use of LLMs in engineering design and analysis.

Main topics of the course:

HES MEPhI

VK / Vkontakte

Yandex.Dzen

MAX

Telegram

Youtube

Rutube

NRNU MEPhI Admissions Committee: