Cache
# Cache
# Memory Organisation
Memory in cache is stored as cache lines. A CPU will try to access cache through a memory address:
# Instruction cache and Data cache
Storing these separately will allow in better parallelism. CPU is able to fetch instructions from instruction cache while writing to data cache for STUR instructions.
Cache Placement Policies
Cache Replacement Policies
Cache Write Policies
# Performance
The key factor affecting cache performance is the effects of cache misses. When a cache miss occurs, compute cycles are needed to find a victim, request the appropriate data from memory, fill the cache line with this new block and resume execution.
# Types of Misses
# Compulsory miss
First reference to a given block of memory. This is an inevitable miss as data has not been accessed before.
# Capacity miss
This occurs when the current working set exceeds the cache capacity. Current useful blocks have to be replaced.
# Conflict miss
When useful blocks are displaced due to placement policies. E.g. fully associative cache mapping
# Design considerations
- Number of blocks
- More blocks means that we will have larger capacity and result in lesser capacity misses
- Associativity
- Reduce conflict misses
- Increases access time
- Block size
- Larger blocks exploit spatial locality. More data in the same area is loaded together when requested.
- Reduces compulsory misses since more data is loaded at once.
- Reduces number of cache blocks for a fixed block size. This leads to increase in conflict miss as higher chance for different data map to the same blocks.
- Increases miss penalty as more data needs to be replaced on a miss.
- Levels of cache
- Using multi-level cache will reduce the miss penalty
# False sharing
An issue arises when different cores have cached values which share the same cache line. The picture below depicts how 2 cores try to access different independent values (X and Y) that resides on the same cache line in L3 cache. If X and Y are highly used variables
- Writing to X invalidates the cache line in Core 2
- Writing to Y invalidates the cache line in Core 1
# Measuring Impact with CPI
$$ \begin{align} &CPU_{time}=(CPU_{\text{execution cycles}}+\text{Memory stall cycles})\times\text{Cycle Time}\\ &CPU_{time}=((IC\times CPI)+(IC\times%\text{Memory Access}\times\text{Miss Rate}\times\text{Miss Penalty}))\times \text{Cycle Time} \end{align} $$ L1 cache hit can be considered to be part of CPI ideal as it is often possible to complete the data access within the ideal clock cycles.
# Example
# Multi-level cache example
# Measuring Impact with Average Memory Access Time (AMAT)
We need a way to measure the performance of cache, standalone from the performance of the CPU. AMAT is the average time to access memory considering both hits and misses $$ \begin{aligned} AMAT&=\text{Hit Time}\times(1-\text{Miss Rate})+\text{Miss Rate}\times(\text{Hit Time}+\text{Miss Penalty})\\ &=\text{Time for hit}+\text{Miss Rate}\times\text{Miss Penalty}\\ \end{aligned} $$ Note here that Miss Penalty is the loss in cycles for a miss, and not just the cost of main memory access. e.g If time to hit cache = 1, time to hit main memory = 100, miss penalty is $100-1=99$. One example to show that AMAT is superior would be to consider two different caches with similar miss rates, but drastically different hit times. Using the miss rate metric, we would rate both caches the same. Using the AMAT metric, a cache with a lower hit time or lower miss penalty will outperform a cache with a higher respective time, assuming all other variables are the same.
# Practice Problems
3 bit index, 2bit tag with 0 bit offset.
Addr | Index | Tag | H/M | State |
---|---|---|---|---|
10011 | 011 | 10 | M | 001,10 |
00001 | 001 | 00 | H | |
00110 | 110 | 00 | H | |
01010 | 010 | 01 | M | 010,01 |
01110 | 110 | 01 | M | 110,01 |
11001 | 001 | 11 | M | 001,11 |
00001 | 001 | 00 | M | 001,00 |
11100 | 100 | 11 | M | 100,11 |
10100 | 100 | 10 | M | 100,10 |
Bits for offset = $log_24=2$ | ||||
2 way set associative cache: Each set contains 2 cache lines -> 2 blocks per set -> 8 bytes per set -> 2 sets | ||||
Bits for index -> 1 | ||||
Access | 10001101 | 10110010 | 10111111 | 100001100 |
—— | ——– | ——– | ——– | ——— |
Tag | 10001 | 10110 | 10111 | 10001 |
Offset | 01 | 10 | 11 | 00 |
Index | 1 | 0 | 1 | 1 |
H/M | M | M | M | H |
LRU1 | 10001 | 10110 | 10111 | 10001 |
LRU2 | NA | NA | 10001 | 10111 |
Hit rate: 2/8 = 25% a. $$ \begin{align} &T=IC\times CPI\times Period\\ &CPI_{ideal}=1.25\\ &\text{L1 stall cycles}=0.2\times0.2\times8=0.32\\ &\text{L2 stall cycles}=0.2\times0.2\times0.1\times30=0.12\\ &\text{CPI stall}=1.25+0.32+0.12=1.69\\ \end{align} $$ b. $$ \begin{align} &\text{L1 stall cycles}=0.2\times0.2\times30=1.2\\ &\text{CPI stall}=1.25+1.2=2.45 \end{align} $$