Module Code - Title:

CS4437 - SCALING FOR RELIABILITY AND PERFORMANCE

Year Last Offered:

2025/6

Hours Per Week:

Lecture

2

Lab

12

Tutorial

1

Other

0

Private

10

Credits

15

Grading Type:

N

Prerequisite Modules:

Rationale and Purpose of the Module:

This is Block 10 (15 ECTS) on the 3+1 Integrated BSc/MSc Immersive Software Engineering and runs Year 3 Weeks 1 to 7 (7 Weeks) in the autumn semester. This block focuses on the complexities that arise outside of the single-processor-single-computer context. The block begins with a demonstration of the reliability and performance problems that arise when relying on a single processor on a single computer. The block then progresses on to a review of the problems of multi-processor synchronization, namely, maintaining state consistency across multiple processes. Next we turn to scaling further via a networked cluster of computers. We examine the properties of TCP, HTTP(s), UDP, and other common networking protocols. From using those principles, we explain the CAP theorem and its implications for software design. Finally, we create a capstone system using a previous application to build a more reliable system with a storage layer (properly sharded and synchronized), and an application layer. Students will prove the reliability of their system via gameday exercises on their system.

Syllabus:

1. Distilling the scaling properties of a system: why and how systems outgrow single machines, how different systems respond to failure 2. Data synchronization constructs on a single computer (mutex, condition variables, semaphores, leader election) 3. Properties of common networking protocols (TCP, UDP, HTTP(s), gRPC) 4, Data serialization using formats such as JSON, XML, protobuf 5. Distributed state/storage solutions, such as distributed databases and P2P networks. Analyzing these solutions using frameworks such as CAP theorem, Paxos, and Raft 6. Properties of a reliable system; using gamedays to test reliability; how to use metrics to measure reliability 7. Incident management: how to respond to incidents, how to design systems to make incident management easier, how to learn from mistakes in a psychologically safe manner

Learning Outcomes:

Cognitive (Knowledge, Understanding, Application, Analysis, Evaluation, Synthesis)

On successful completion of this module, students will be able to: - Describe the performance and reliability costs of single-processor, single-machine systems - Use data synchronization constructs in a program that runs on a single-machine and a single-processor - Write programs that use networking protocols such as TCP, UDP, and HTTP, and describe their reliability properties - Write a substantial system that uses serialization formats to exchange complex data - Instrument a substantial system for automated measurements of reliability at scale and for for manual responses to incidents - Create state management solutions (i.e. databases, distributed kv stores, etc) for shared state in a distributed system - Demonstrate an understanding of the scaling properties of different state management solutions (i.e. which approaches work at 10, 100, 1k, etc. application nodes). - Use gamedays to demonstrate the reliability of a system

Affective (Attitudes and Values)

On successful completion of this module, students will be able to: - Think about reliability as a first class quality of a system, not an afterthought that can be bolted on later. - Understand the principles of psychological safety needed to generate useful behavior in crisis situations.

Psychomotor (Physical Skills)

On successful completion of this module, students will be able to:

How the Module will be Taught and what will be the Learning Experiences of the Students:

The block is taught using the problem-based learning, the flipped classroom concept, and blended learning in a state of the art laboratory setting with an emphasis on collaborative practiceand technical excellence. Learning and teaching will be research led with a focus on translating theory into practice, innovation and knowledge creation.

Research Findings Incorporated in to the Syllabus (If Relevant):

Prime Texts:

B. Beyer, J. Petoff, C. Jones, and N. R. Murphy (2016) Site Reliability Engineering: How Google Runs Production Systems , O'Reilly

M. Kelppmann (2017) Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems , O'Reilly

Other Relevant Texts:

A. Xu (2020) System Design Interview - An insider's guide, Second Edition , Independently published

Programme(s) in which this Module is Offered:

Semester(s) Module is Offered:

Autumn

Module Leader:

james.patten@ul.ie