Foundations

Reliable, scalable, and maintainable applications

Typically, data-intensive appls are built from standard building blocks. Many applications need to:

Store data (databases)
Remove result of expensive operation (caches)
Allow users to search data by keyword or filter in various ways (search indexes)
Send message to another process to be handled asynchronously (stream processing)
Periodically crunch large amount of data (data processing)

Thinking about data systems

The building blocks (caches, queues, etc.) wseem quite idfferent- why lump them togethr?

Lots of overlap between categories
No single tool can meet of data processing or storage needs When you combine several tools, you’ve baseically createda. new special-purpose data system

When designing these things, 3 concerns are most important

Reliability: work correctly even in face of adversity
Scalability: should have reasonable ways of dealing with the system’s growth
Maintainability: engineers should be able to work on it productively

Reliability

Typical expectations

Application performs function user expected
Can tolerate user making mistakes
Performance good enough for the use case under expected lload / data volume
Prevents unauth access and abus

Our goal should be to create fault-tolerant systems

Might want to actually increase rate of faults by triggering them deliberately

Fault vs failure

Fault: One component deviating from its spec
Failure: system stop providing service to user Goal: prevent faults from causing failures

Hardware Faults

e.g. hard disk crashes, faulty RAM, blackout, etc. With > data volume and computing demands, many many machines, and thus need to tolerate the loss of entire machines

Software Errors Systematic error within the system Examples

Software bug which caues each instance of an applcication server to crash when given a particular bad inpu
Runaway process which uses up some shared resource
Etc.

Human Errors Inherently unreliable Combine several approaches

MInimize opportunities for error
Decouple places where people make the most mistakes from where they can cause failures
Test thorougly
Enable quick recovery
Clear monitoring

Scalability

Ability to cope with increased load Consider key qustions e.g.
How can we add computing resources to handle additional load?
If system grows x way, what are our optins for coping with the growth?

Describing Load Load is described with a few number- load parameters Depends on your system’s architecture e.g.

Requests per second to web server
Ratio of reads to writes
etc.

As an example, for Twitter- the scaling challenge comes from fan-out

Describing performance After describing load, can investivate what happens when load increases 2 ways of looking at it- when you increase a load parameter…

and keep system resources unchanged, how does performance change?
how much do you need to increase resources to keep performance unchanged? Need to have performance numbers to think about this
E.g. in batch processing- throughput
Latency
Response time percentiles

Approaches to coping with load How to maintain good performance when load parameters increase? Note: likely you’ll need to rethink your architecture on every order of magnitude load increase (maybe more often)

Scaling up vs scaling out Distributing load across multiple machines: shared-nothing architecture Elastic systems: can automatically add resources when detect load increase

Maintainability

Main cost of software- ongoing maintenance

3 principles

Operability
Simpliciy
Evolvability

Operability: making life easy for Opeartions Operations teams need to

Monitor health of system + be able to quickly restore
Tracking down cause of problems
Keep software and platforms up to date
Keep tabs on how different systema ffect each other
Anticipate problems
Establish good practices/tools for deployment, etc.
Complex maintenance tasks
Security of system as config changes are made
Define processes to make opeartions predictable
Preserve knowledge Data systems can make routine tasks easy
Provide visibility with good monitoring
Good support for automatiion + integration with standard tools
Avoid dependency on individual machines
Provide good documentation and easy-to-understand operational model
Self-heal where appropriate

Simplicity: managing complexity Large systems get complex!

Symptoms of complexity

Large state space
Tight coupling of modules
Tangled dependencies
Naming / terminology
Et cetera

Remove accidental complexity- don’t necessarily decrease its functionality

Accidental complexity: if not inherent in the problem that the software solves, but insteadi s from the implementation
How? Abstraction- hide the implementation details

Evolvability: making change easy System requirements will very likely change over time as you learn new facts, etc. Agility on data system level: evolvability

In sum

Appication has various requirements

Functional: what it can do
Nonfunctional: compliance, scalability, reliability, etc. Reliability: systems work correctly even with faults Scalability: strategies for keeping performance good Mainainability: making life better for engineering + ops teams

Pablo's Reference Notes

Explorer

Foundations

Foundations

Thinking about data systems

Reliability

Scalability

Maintainability

In sum

Graph View

Table of Contents

Backlinks