So far we’ve talked about requests/queries and responses/results (e.g. HTTP-REST-based APIs), but this isn’t the only way to build data systems
3 types of systems
- Services (online)
- Service- waits for request and then handles with response as quickly as possible
- Metrics
- Reponse time (primary)
- Availability (also important)
- Batch Processing Systems (offline)
- Take in large amount of input data, runs job, and produces output
- Metrics
- Throughput (primary)
- Stream processing Systems (near real-time)
- Somewhere betwee online / batch processing
- Operates on events shortly after they happen
Why do we care about batch processing? It’s extremely important for scalable apppicaitions (e.g. MapReduce)
Batch Processing with Unix tools
Take a simple example: web server with line in log file for each request (with info for who user is, what they requested, web browser, etc.)
Simple Log Analyis
Usually you’ll use existing tools, but you can build your own Unix tools
- Read log file
- Split line in whitespace and print 7th word (requested URL)
- Sort requested URLs
- Filter out repeating lines
- Sort by number of times URl was requested
- Output 5 first lines
chain of command vs. custom program Alternative would be to write your own quick command (e.g. Ruby or Python)
Picking between the two methods is a matter of taste, but there are some considerations
- Sorting vs in memory aggregation
- Hash table approach: if small enough data, can easily fit all URLs
- Sorting approach: makes more efficient use of memory
Unix Philosophy
Unix Philosophy: set of design principles popular among devs/users of Unix
- Make each program to done thing well
- Expect output of every program become iput to another
- Design to be tried early on
Basicaly: automation, rapid prototyping, etc.
Uniform interface If output = input somewhere else, need a compatible interface
In Unix- interface is a file (really, file descriptor)
- With a uniform interface of ASCII text
Separationfo logic and wiring
Another characteristic feature of Unix tools: use of stdin
and stdout
You can use pipes to run these all over
To participate in data processing pipelines
- Just needs to read from
stdin
and write tostdout
Transparency and Experimentation It’s also quite easy to see what’s going on in Unix tools
- Input files: treated as immutable
- Run commands as often as you want
- Can end pipeline at any point- pipe the output into
less
- Debug very easily
- Can write output to file and use file as input later
- Split pipeline into chunks
MapReduce and Distributed Fileystems
MapReduce: programming framework for writing code to process large datasets in a distributed filesystem
- Is like Unix tools, but distributed
- Blunt but effective tool
- MapReduce job single Unix process
- 1+ input → 1+ output
- Instead of
stdin
andstdout
, distributed file system (e.g. Hadoop’s HDFS)
HDFS (Hadoop File System):
- Shared nothing principle
- Daemon process on each machine exposing network service which allow snodes to access files stored on the machine
NameNode
server racks which file blocks stored on which machine- Conceptually: HDFS creates big filesystem which can use space of all the machines running the daemon
- Dat is replicated on machines
MapReduce Job Execution
In order to create a MapReduce job, you need two callback function
- Mapper: extract key + value from input record
- Called once for every input record
- May generate any number of key-value pairs
- Reducer:
- Takes key-value pairs, collects those with same key, reducer iterates over collection of values
Distributed Execution of MapReduce TODO
MapReduce Workflows TODO
Reduce-Side Joins and Grouping
How exactly are joins implemented?
In many datasets, foreign keys (relational) / document references (document model) / edge (document model) are common
Joins are needed when you want to access reocrds on both sides of the association
MapReduce performs full table scan
- When MapReduce given a set of files as input- reads the entire content
Example: analysis of user activity events TODO
Sort-merge joins TODO
Bringing related data together in the same place TODO
GROUP BY TODO
Handline Skew TODO
Map-Side Joins
TODO
Output of batch workflows
TODO
Comparing Hadoop to Distributed databases
Hadoop is somewhat like a distributed version of Unix, where HDFS is the filesystem
- Very similar to maively parallel processing databases
The biggest difference is that
- MPP databases: focus on parallel execution of analytic SQL queries on a cluster of machines,
- Combination of MapReduce and distributed filesystem: more like a general-purpose OS which can run arbitrary programs
Diversity of storage
- Hadoop: opened up the possibility of indiscriminately dumping data into HFS and worrying about further processing later.
- MPP databases: needed careful upfront modeling of data nd query patterns
This is why Hadoop is often used for ETL processes
- Data from transaction processing = dupmed into distributed filesystem
- Then- MapReduce jobs written toc lean up the data + tansfomrm into relational form, import into MPP data warehouse
Diversity of processing models
- MPP: monolthic tightly integrated pieces of software
- MapReduce: flexibility to run your own code over arge datasets
Designing for Frequent Faults
On failure of task
- MPP: abort ientire query
- MapReduce: can tolerate failure of map / reduce task by retyring work at granularity of individual task
- Also- eager to write data to disk
- More appropriate for larger jobs
Beyond MapReduce
MapReduce is just one among many programming modes for disributed systems
We talked so much about MapReduce because it is a useful learning tool- clear abstraction on top of a distributed filesystem
Although- MapReduce is quite difficult to use directily- you would need to e.g. implement any join agorithms from scratch
Now, let’s look at some alternatives for batch processing
Materialization of Intermediate state
TODO
Graphs and Iterative Processing
TODO
High-level API and Languages
Attention has turned to
- Improving programming model
- Improving efficineccy of processing
- Etc.
APIs + higher level languages popular isince MapReduce programming is laborious
Dataflow PAIs
- Relational sytyle building lbocks to express computation
- Allow interactive use
- Both improve human interface + job execution efficiency
Move toward declarative query languages
Specifying join as relational operators vs. spelling out code
- Framework can anlyze properties and decide which join algorithm to use
- Can make big difference TODO
Summary
Unix
problems which distributed batch processing frameowkrs need to solve
- Partitioning
- Fault Tolerance
Join algorithms for MapReduce
- Sort-merge joins
- Broadcast hash joins
- Partiitoned hash joins
Resitricted programming model:
- Callback functions assumed to be stateless
- Allows for retries safely
Batch processing job
- Input data is bounded
- Reads some input data and roces output data without modifying Input data