Brendan Ang

Search

Search IconIcon to open search

GPU Architecture

Last updated Nov 8, 2022 Edit Source

# GPU Architecture

The general purpose CPU is designed for single-threaded code optimised for low latency. The GPU allows us to achieve higher throughput in exchange for higher latency.

Need to achieve massive data parallelism for computing tasks such as vector processing and Multiplication and Accumulation (MAC) operations in matrices. SIMD: Single instruction multiple data

# CUDA

# Architecture

# Programming Model

CUDA works on a heterogeneous programming model that consists of a host and device. Host calls the device to run the program.

# Programming Language

The source code is split into host (compiled by standard compilers like gcc) and device components (compiled by nvcc).

# Kernel

# Threads and Thread Blocks

We can access important properties of the kernel:

# Synchronisation

# Memory management

The above code does not take advantage of GPU parallelism in the CUDA core. We can create 1 block with 3 threads to achieve parallelism: vector_add_cu<<<1,3>>>(d_c, d_a, d_b); Use the threadIdx to access the memory:

[! Threads vs Blocks] The example can also be achieved using 3 blocks each with 1 thread. However, parallel threads have the advantage to directly communicate and synchronise with each other due to shared hardware. Sharing memory between blocks would require global memory access

# Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
//initialize 1 block and 3 threads. We cannot use 3 blocks for this implementation as blocks would nt be able to share the local variable memory
Dot_prod_cu<<<1,3>>>(d_c, d_a, d_b);

__global__ void dot_prod_cu(int *d_c, int *d_a, int *d_b){
	//use __shared__ to allow threads to share data
	__shared__ int tmp[3];
	
	int i = threadIdx.x;
	tmp[i] = d_a[i] * d_b[i];

	//wait for all threads to complete to prevent premature entering into if block
	__syncthreads();
	
	if (i==0){
		int sum = 0;
		for (int j = 0; j < 3; j++)
		sum = sum + tmp[j];
		*d_c = sum;
	}
}

# Internal Operations

# Warps

# SIMT

Warps enable a unique architecture called Single Instruction Multiple Thread. This means each warp executes only one common instruction for all threads.

Within a single thread, its instructions are

# Thread Divergence

Branch statements will result in some threads in a warp wasting their clock cycles. This is because the threads in the warp must all execute the same instruction. For some which satisfy the condition, computation is done else NOP.

# Practice Problems

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
__global__ void stencil(int N, int *input, int *output) {
	blockNum = blockIdx.x;
	i = threadIdx.x + blockNum * N;
	int sum = input[i];
	for(int i = 1; i < 3; i++) {
		sum += input[i-i]
		sum += input[i+i]
	}
}
int N = len(input) / BLOCK_SIZE
output = (int *) malloc(N * sizeof(int))
stencil<<<N, BLOCK_SIZE>>>(N, input, output)