Pipelining

Last updated Nov 8, 2022 Edit Source

# Pipelining

# Datapath

Pipelining makes use of extra registers between each pipeline in order to store the necessary data and control signals needed by the current instruction for the next stage. Without it, the next instruction will override the information.

# Clock Period

Pipelining allows us to achieve smaller clock period (and hence higher frequencies) by reducing the time required for a single stage to complete.

# Limitation

Single cycle: $1.2+0.8+0.7+1.3=4ns$
Pipelined: $1.3+0.3=1.6ns$ (register delay needs to be taken into account) Clock period of a single stage = $4/1000 = 0.004ns$ Maximum delay = $0.004+0.3=0.304ns$ Speedup = $4/0.304=13$ A 1000x increase in pipeline registers only results in a 13x increase in speedup!

# Hazards

A hazard is a drop in efficiency in the pipeline due to stalling. We can measure the effect of stalls using steady state CPI $$\text{Steady State CPI} = (No.Instructions+No.Stalls)/No.Instructions$$

# Data hazard

When either the source or destination register of an instruction is not available at the time expected in the pipeline. RAW dependencies are difficult to handle and results in stalling for a pipeline architecture.

# Detect and Wait

Wait until the required value is available in the register file by stalling (hardware) or inserting NOPs (software)

# Data Forwarding

Data forwarding via register: We can write and read from register in the same clock cycle. This means that WRITE-BACK and DECODE stage can happen at the same time

Forwarding via register is easy to implement. Even when we say there is no forwarding, forwarding via register is considered to still be in place

For other stages, we can only forward from the previous clock cycle: Notes

SUB needs X1 value latest during the execute stage
AND can obtain X1 latest through register forwarding

# Dynamic Scheduling

Out-of-order execution and completion: Reordering introduces the possibility of WAR and WAW hazards which were not possible in an in-order execution pipeline. These can be solved via register renaming:

# Loop Unrolling

Further optimizations can be made for looping code. Loop segments contain a high level of overhead (lines that work on the loop variable and branch commands), which are not directly contributing to the work of the loop body.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
for (i=999; i>=0; i=i–1) 
	x[i] = x[i] + s;
"""""""
L.D F0,0(R1) 
DADDUI R1,R1,#-8 //overhead
ADD.D F4,F0,F2  
stall //overhead
stall //overhead
S.D F4,8(R1) 
BNE R1,R2,Loop //overhead

Combine unrolling with dynamic scheduling: Example of 4 factor unrolling:

# Control hazard

Conditional and unconditional jumps, subroutine calls, and other program control instructions can stall a pipeline because of a delay in the availability of an instruction.

A naive (conservative) way would be to stall the pipeline whenever we encounter a branch instruction. Depending on hardware, this results in number of stalls (branch penalty) based on which stage of the pipeline the branch address is determined. Worst case: One clock cycle stall or flush of one instruction after each branch. $\text{Pipeline Stall Cycles per Instruction due to Branches} = \text{Branch frequency} \times \text{Branch Penalty}$

# Static Branch Prediction

Delayed Branching: schedule an independent instruction in the branch delay slot. If branch penalty is 1, we will have 1 branch delay slot.

# Dynamic Prediction

Rely on some measure of past behaviour to predict the future

# Structural hazard

When two instructions require the use of a given hardware resource at the same time that will lead to a stall in the pipeline (one instruction has to wait at least for a clock cycle).

Consider we have only one memory. For a case when write stage access the memory for writing the result and the instruction fetch stage tries to fetch the instruction from the memory at the same time.

# Practice Problems

a. Pipelined minimum is the max of all the stages: 500ps Non-pipelined min is the sum of all stages: 1650ps b. LDUR: IF -> ID -> EX -> MEM -> WB Non-pipelined sum all stages: 1650ps Pipelined: Each stage must take 500ps leading to 2500ps c. MEM stage. Clock period: 400ps d.

Used in LDUR and STUR 25%
Used in ALU and LDUR 65% a. 2 stall cycles per hazard = 6 total b.
LDUR instruction requires X0 latest at the execute stage, where the ALU calculates the memory address value. X0 is known at the E stage of ADDI: 0 stalls required
ADD instruction requires X2 latest at the E stage. X2 is known at the MEM stage of LDUR when it is loaded from data memory. Forward from M -> E: **1 stall
STUR instruction requires X3 latest at the Mem stage where it is put into data memory. X3 is updated at the E stage of ADD: 0 stalls required c. No Forwarding CPI: $(6+6)/6 =2$ Forwarding CPI: $(6+1)/6 =1.17$ a. Always taken: 75%, 60% Always not taken: 25%, 40% b.
0%
20% c.
Only 1 not taken states, resulting in predictor staying in the 11 and 10 states. 75%
40%. Cycle from 11->10->01 a. How to do this one? Number of stalls per loop = 2 + 2 + 1 = 5 Total useful instructions: $1+6\times x= 1+6x$ Total instructions: $1+(6+5)\times x=1+11x$ x = 5: $\frac{31}{56}=0.55$ x = 100: $\frac{601}{1101}= b.

$$ \begin{align} &\text{Branch Penalty Unconditional}=1 &\text{Branch Penalty Conditional}=2\\ &\text{Stall cycles Unconditional}=0.01\times1=0.01\\ &\text{Stall cycles Conditional}=0.15\times0.6\times2=0.18\\ &CPI = 1+0.18+0.1=1.19 \end{align} $$

Brendan Ang