Core issue: We need more resources / faster programs for modern applications

Traditional approach was sequential (think back to original Neumann design)

  • Rely on advances in hardware- increased clock speed
  • Get better microprocessors

But processor cores are no longer increaseing at the rate they were- which slows down the whole industry

The solution is parallel programs, which led to the concurrency revolution

Heterogeneous Parallel Computing

Since 2003, 2 main trajectories for microprocessors

  • multicore: maintain sequential program execution but move into multiple cores
  • many-thread- focus on execution throughhput (e.g. GPUs)
    • Much more performant

Why the performance gap?

  • GPUs oriented on throughput
  • CPUs on latency (but consumes capacity which could otherwise be sent on more execution units / memory access)

Note: reducing latency is much more expensive than increasing throughput.

Why did GPUs win?

  • Large existing user base (gaming)
  • CUDA

Why do we want more speed or parallelism?

  • As we get higher quaity things, hard to go back to older tech (think HDTV)
  • Better Uis
  • Machine learning + research

Speeding up real applications

Speedup: ratio of time used to excute system B over time to execute system A

  • e.g. 200 seconds compared to original of 10, would be a 10x speedup

How to know what’s achievable?

  • Percentage spent on paralelizable calculates upper limit (ie. can’t speed up the “peach flesh” portions)
  • How fast data can be accessed from / written to the memory
  • Original suitability of CPU to application

Why is Parallel Programming hard?

(1) Designing paralle algorithms with same level of compelxity as requentia agoirhtm is hard

  • Non-intuitive
  • Redudnat work potentially

(2) Speed of applications can be limited by memory access latency and/or throughput

  • “Memory bound applications”- can optimize this

(3) Execution speed of parallel programs is more sensitive to input data characteristics than for sequential programs.

  • Can use regularization techniques

(4) Some applications require threads to coaborate with each other

  • Require using synchronization operations

Most of these have ben addressed by researchers

Overarching goals / uses

(1) Goal: program massively parallel processors to achieve high perforamcne

  • Intuition
  • Knowledge of hardware

(2) Teach parallel programming for correct functionality and reliability

  • Necessary if you want to support users

(3) scalability across future hardsware generations

  • Have programs that can scale up to level of performance of new generations of machines

Architecture tradeoffs

CPU: Latency-Oriented Design

  • A few powerful ALUs (Arithmetic Logic Units):
    • Capable of performing complex operations.
    • Designed to reduce the latency of each operation.
  • Large caches:
    • To mitigate the latency of memory access by keeping data closer to the processing units.
    • Caches are optimized for quick access to data, reducing the time to retrieve information.
  • Sophisticated control:
    • Includes mechanisms like branch prediction to anticipate the directions of branches (if/else conditions) and prepare execution paths.
    • Employs data forwarding to mitigate data hazards (delays caused by data not being ready when needed).

GPU: Throughput-Oriented Design

  • Many small ALUs:
    • Focused on performing many operations in parallel.
    • Trades off the speed of individual operations for the ability to do many simultaneously, prioritizing throughput over latency.
  • Small caches:
    • Less cache per ALU compared to CPUs.
    • More silicon area is dedicated to ALUs rather than cache.
  • Simple control:
    • Less complex control logic than CPUs.
    • More of the GPU’s silicon area is allocated for computation rather than sophisticated control logic.