modern GPU architecture

SM: streaming multiprocessor

  • Array of which composes CUDA-capable GPU
  • Composition
    • Several CUDA Cores
    • Shared control logic
    • Shared memory
  • Block-level granularity: may have multiple blocks, but each block lives only on one SM
    • Guaraentees threads in same block are scheduled simulatneously


  • Barrier synchronization: Threads in the same block can synchronize with __syncthreads()
    • Holds up program until each thread in block reaches that location
  • The __syncthreads() MUST be executed by all thread in a block

Transparent scalability: blocks can execute at any order

  • Makes programs architecture agnostic

Timing of threads within each block

  • Assume that threads can execute in any order (unless you use barrier synch- than can sync them up)

Warp: unit of thread scheduling- how many threads are executed at the same time

  • Typically size of 32

SIMD: single instruction multiple data model

  • SMs folow this
  • At any instant in time, one instruction is fetched + executed for all threads in the warp

Control divergence