Large and Fast: Exploiting Memory Hierarchy

Cache Hierarchies

Data and instructions are stored on DRAM chips
DRAM is a technology that has high bit density, but relatively poor latency
an access to data in memory can take as many as 300 cycles!
Hence, some data is stored on the processor in structure called the cache
caches employ SRAM technology, which is faster, but has lower bit density
Internet browsers also cache web pages – same concept

Memory Structure

Address and data
Address is the index, are not stored in memory
Address can be in unit of byte or in unit of word
Only data is stored in memory

Memory Hierarchy

As you go further, capacity and latency increase

Store everything on disk
Copy recently accessed (and nearby) items from disk to smaller DRAM memory
Main memory
Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory
Cache memory attached to CPU

Memory Hierarchy Levels

Block (also called line): unit of copying
May be multiple words
The memory in upper level is originally empty
If accessed data is absent
Miss: block copied from lower level
Time taken: miss penalty
Miss ratio: misses/accesses
If accessed data is present in upper level
Hit: access satisfied by upper level
Hit ratio: hits/accesses = 1 – miss ratio
Then accessed data supplied from upper level

Locality：

Why do caches work?
Temporal locality: if you used some data recently, you will likely use it again
Spatial locality: if you used some data recently, you will likely access its neighbors

Memory Technology：

Static RAM (SRAM)
0.5ns – 2.5ns, $500 – $1000 per GB
Volatile: data will lost when SRAM is not powered
Dynamic RAM (DRAM)
50ns – 70ns, $10 – $20 per GB
Flash
5us – 50us, $0.75-$1 per GB
Magnetic disk
5ms – 20ms, $0.05 – $0.1 per GB
Ideal memory
Access time of SRAM
Capacity and cost/GB of disk

SRAM Technology

DRAM Technology

Advanced DRAM Organization

Flash Storage

Disk Storage

Disk Sectors and Access

Cache Memory

Cache memory
The level of the memory hierarchy closest to the CPU
Given accesses X1, …, Xn–1, Xn

Direct Mapped Cache

Memory size: 32 words, cache size: 8 words
The address is in unit of word
Direct mapper cache:
Location determined by address
One data in memory is mapped to only one location in cache
The lower bits defines the address of the cache

Tags and Valid Bits

How do we know which particular block is stored in a cache location?
Store block address as well as the data
Actually, only need the high-order bits
Called the tag
What if there is no data in a location?
Valid bit: 1 = present, 0 = not present
Initially 0

Larger Block Size:

Assume
32-bit address
Direct mapped cache
$2^n$ number of blocks, so n bit for index
Block size: $2^m$ words, so m bit for the word within the block
Calculate:
Size of tag field: 32-(n+m+2)
Size of cache: $2^n*(\text{block size} + \text{tag size} + \text{valid field size})\ = 2n*(2m*32+(32-n-m-2)+1)$

Block Size Considerations:

Larger blocks should reduce miss rate

Due to spatial locality

But in a fixed-sized cache

Larger blocks => fewer of them
More competition => increased miss rate
Larger blocks => pollution

Larger miss penalty (with longer address bits)

Can override benefit of reduced miss rate

Early restart and critical-word-first can help

提前缓存块

先拿到需要的，再去缓存整个块

Cache Misses

On cache hit, CPU proceeds normally
On cache miss
Stall the CPU pipeline
Fetch block from next level of hierarchy
Instruction cache miss
Restart instruction fetch
Data cache miss
Complete data access

Write-Through

On data-write hit, could just update the block in cache
But then cache and memory would be inconsistent
Write through: also update memory
But makes writes take longer
write buffer：【improve】
Holds data waiting to be written to memory
CPU continues immediately
Only stalls on write if write buffer is already full

Write-Back：

On data-write hit, just update the block in cache
Keep track of whether each block is dirty
When a dirty block is replaced
Write it back to memory
Can use a write buffer to allow replacing block to be read first

Write Allocation

What should happen on a write miss?

Alternatives for write-through

Allocate on miss: fetch the block
Write around: don’t fetch the block
Since programs often write a whole block before reading it (e.g., initialization)

For write-back

Usually fetch the block

Write Policies Summary

If that memory location is in the cache:

Send it to the cache

write-through policy: send it to memory right away

write-back policy: Wait until we kick the block out

If it is not in the cache:

write allocate policy: Allocate the block (put it in the cache)
no write allocate policy： Write it directly to memory without allocation