Data paralelism: computation work on different parts of dataset done independently/in parallel
Task parallelism: task decomposition (ie. vector addition + matrix-vector application independently)
Slow programs- issue is usually too much data to process
CUDA C Structure
Terms
- Host: CPU
__host__
keyword- function is a CUDA host function 9only executed/called on host
- Device: GPU
- Host code: CPU serial code
- Grid: all threads launched on a device to execute the kernel
- Block: each of same size on grid, contains threads
- Host memory:
_h
indicates object in device memory?__device__
keyword- function is a CUDA device function (can be called from only kernel or device function)
- Global memory:
_d
indicates object in device global memory__global__
keyword- function is a CUDA C kernel function
- Configuration parameters: given between
<<<
and>>>
- First: number of blocks in grid
- Second: number of threads in block
Functions
cudaMalloc()
- call on host code to allocate global memory for an object- Allocates object in device global memory
- Parameters
- Address of pointer to allocated object
- Size of allocated object in term of bytes
cudaFree
-- Frees object from device global memory
- Parameter
- Pointer to freed object
cudaMemcpy
- Memory data transfer
- Parameters
- Pointer to destination
- Pointer to source
- Number of bytes copied
- Type/direction of transfer (device/host → device/host)
cudaMemcpyHostToDevice
andcudaMemcpyDeviceToHost
- Predefined constants of the CUDA programming environment
Within Kernel
blockDim
.x
(if 1D)- indicates the total number of threads in each block
threadIdx
.x
(if 1D)- current thread within block
blockIdx
.x
(if 1D)- current block coordinate
General structure
- Alocate GPU memory
- Copy data to GPU memory
- Perform computation on GPU
- Copy data from GPU memory
- Deallocate GPU memory
How to compile C kernels?
- NVCC
- C with CUDA turns into
- Host code (straight ANSI)
- Device code (PTX)- executed on CUDA-capable GPU device
Example run through
Multi-dimensional grids and data
We’ve talked about one-dimensional grids of threads, but how about multimimensional arrays of data?
Remember the built-in block and thread variables.
In general, grid is a 3D array of bocks nad each block is a 3D array of threads
As an example (for creating 1D grid with 32 blocks with 128 threads)
ETC.
Dimensions are a factor of