Core issue: We need more resources / faster programs for modern applications

Traditional approach was sequential (think back to original Neumann design)

Rely on advances in hardware- increased clock speed
Get better microprocessors

But processor cores are no longer increaseing at the rate they were- which slows down the whole industry

The solution is parallel programs, which led to the concurrency revolution

Heterogeneous Parallel Computing

Since 2003, 2 main trajectories for microprocessors

multicore: maintain sequential program execution but move into multiple cores
many-thread- focus on execution throughhput (e.g. GPUs)
- Much more performant

Why the performance gap?

GPUs oriented on throughput
CPUs on latency (but consumes capacity which could otherwise be sent on more execution units / memory access)

Note: reducing latency is much more expensive than increasing throughput.

Why did GPUs win?

Large existing user base (gaming)
CUDA

Why do we want more speed or parallelism?

As we get higher quaity things, hard to go back to older tech (think HDTV)
Better Uis
Machine learning + research

Speeding up real applications

Speedup: ratio of time used to excute system B over time to execute system A

e.g. 200 seconds compared to original of 10, would be a 10x speedup

How to know what’s achievable?

Percentage spent on paralelizable calculates upper limit (ie. can’t speed up the “peach flesh” portions)
How fast data can be accessed from / written to the memory
Original suitability of CPU to application

Why is Parallel Programming hard?

(1) Designing paralle algorithms with same level of compelxity as requentia agoirhtm is hard

Non-intuitive
Redudnat work potentially

(2) Speed of applications can be limited by memory access latency and/or throughput

“Memory bound applications”- can optimize this

(3) Execution speed of parallel programs is more sensitive to input data characteristics than for sequential programs.

Can use regularization techniques

(4) Some applications require threads to coaborate with each other

Require using synchronization operations

Most of these have ben addressed by researchers

Overarching goals / uses

(1) Goal: program massively parallel processors to achieve high perforamcne

Intuition
Knowledge of hardware

(2) Teach parallel programming for correct functionality and reliability

Necessary if you want to support users

(3) scalability across future hardsware generations

Have programs that can scale up to level of performance of new generations of machines

Architecture tradeoffs

CPU: Latency-Oriented Design

A few powerful ALUs (Arithmetic Logic Units):
- Capable of performing complex operations.
- Designed to reduce the latency of each operation.
Large caches:
- To mitigate the latency of memory access by keeping data closer to the processing units.
- Caches are optimized for quick access to data, reducing the time to retrieve information.
Sophisticated control:
- Includes mechanisms like branch prediction to anticipate the directions of branches (if/else conditions) and prepare execution paths.
- Employs data forwarding to mitigate data hazards (delays caused by data not being ready when needed).

GPU: Throughput-Oriented Design

Many small ALUs:
- Focused on performing many operations in parallel.
- Trades off the speed of individual operations for the ability to do many simultaneously, prioritizing throughput over latency.
Small caches:
- Less cache per ALU compared to CPUs.
- More silicon area is dedicated to ALUs rather than cache.
Simple control:
- Less complex control logic than CPUs.
- More of the GPU’s silicon area is allocated for computation rather than sophisticated control logic.

Pablo's Reference Notes

Explorer

Intro to parallel computing