Memory Performance and Dependable Memory Hierarchy

Measuring Cache Performance

Components of CPU time

  • Program execution cycles 【execution】
  • Includes cache hit time
  • Memory stall cycles 【IO】
  • Mainly from cache misses

With simplifying assumptions:

$,,,,,,,Memory,stall,cycles\ = \frac{memory,access}{Program}\times Miss,Rate \times Miss , Penalty= \frac{Instructions}{Program} \times \frac{Misses}{Instructions} \times Miss,Penalty$

!> 此处需要注意指令内存100%访问,而数据内存访问依指令比例而定

Average Access Time

  • Hit time is also important for performance
  • Average Memory Access Time (AMAT)
  • $AMAT = Hit,time + Miss,rate × Miss,penalty$

Example:

  • CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5%

Performance Summary

  • When CPU performance increased
  • Miss penalty becomes more significant
  • CPI=2, Miss=3.44, % of memory stall: 3.44/5.44=63%
  • CPI=1, Miss=3.44, % of memory stall: 3.44/4.44=77%
  • Decreasing base CPI
  • Greater proportion of time spent on memory stalls
  • Increasing clock rate
  • Memory stalls account for more CPU cycles
  • Can’t neglect cache behavior when evaluating system performance

Associative Cache Example

Direct Mapped Cache

  • Location determined by address
  • Direct mapped: only one choice
  • Capacity of cache is not fully exploited
  • Miss rate is high

Associative Cache Example

Associative Caches

  • Fully associative
  • Allow a given block to go in any cache entry
  • Requires all entries to be searched at once
  • Comparator per entry (expensive)
  • n-way set associative
  • Each set contains n entries
  • Block number determines which set
  • (Block number) modulo (#Sets in cache)
  • Search all entries in a given set at once
  • n comparators (less expensive)
  • Trade off:
  • Increased associativity decreases miss rate
  • But with diminishing returns

Example:

Replacement Policy

  • Direct mapped: no choice
  • Set associative
  • Prefer non-valid entry, if there is one
  • Otherwise, choose among entries in the set
  • Least-recently used (LRU)
  • Choose the one unused for the longest time
  • Simple for 2-way, manageable for 4-way, too hard beyond that
  • Random:
  • Gives approximately the same performance as LRU for high associativity

Multilevel Caches

  • Primary cache attached to CPU
  • Small, but fast
  • Level-2 cache services misses from primary cache
  • Larger, slower, but still faster than main memory
  • Main memory services L-2 cache misses
  • Some high-end systems include L-3 cache

Example:

  • Now add L-2 cache
  • Access time = 5ns
  • Global miss rate to main memory = 0.5%
  • Primary miss with L-2 hit
  • Penalty = 5ns/0.25ns = 20 cycles
  • Primary miss with L-2 miss
  • Extra penalty = 400 cycles
  • CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4
  • Performance ratio = 9/3.4 = 2.6

还有一种题目告诉 L1 hit time determines clock cycle, 此时使用 L1 hit time 作为 clock cycle time 即可。

Multilevel Cache Considerations:

  • Primary cache
  • Focus on minimal hit time 【!>决定了时钟频率的上限】
  • L-2 cache
  • Focus on low miss rate to avoid main memory access
  • Hit time has less overall impact
  • Results
  • L-1 cache usually smaller than a single cache
  • L-1 block size smaller than L-2 block size

Software Optimization via Blocking

  • Goal: maximize accesses to data before it is replaced
  • Consider inner loops of DGEMM:

Dependability Measures

Reliability: mean time to failure (MTTF)

Service interruption: mean time to repair (MTTR)

Mean time between failures

  • MTBF = MTTF + MTTR

Availability = MTTF / (MTTF + MTTR)

Improving Availability

  • Increase MTTF: fault avoidance, fault tolerance, fault forecasting
  • Reduce MTTR: fault detection, fault diagnosis and fault repair

The Hamming SEC Code

  • Hamming distance
  • Number of bits that are different between two bit patterns
  • Minimum distance = 2 provides
  • single bit error detection
  • parity code
  • Minimum distance = 3 provides
  • single error correction
  • 2 bit error detection

Encoding SEC (Single Error Correction)

To calculate Hamming code:

  • Number bits from 1 on the left 【索引从1开始】
  • All bit positions that are a power 2 are parity bits 【parity bit 的位置】
  • Each parity bit checks certain data bits 【能够纠错的原理】

!> 这张图片的规律很重要【如何编码】

Decoding SEC

  • Value of parity bits indicates which bits are in error
  • Use numbering from encoding procedure
  • E.g.
  • Parity bits = 0000 indicates no error
  • Parity bits = 1010 indicates bit 10 was flipped

SEC/DED Code

  • Add an additional parity bit for the whole word ($p_n$)
  • Make Hamming distance = 4
  • Decoding:
  • Let H = SEC parity bits
  • H even, $p_n$ even, no error
  • H odd, $p_n$ odd, correctable single bit error
  • H even, $p_n$ odd, error in $p_n$ bit
  • H odd, $p_n$ even, double error occurred

Note: ECC DRAM uses SEC/DED with 8 bits protecting each 64 bits

Summary

  • Cache Performance
  • Mainly depends on miss rate and miss penalty
  • To improve cache performance:
  • Fully associative cache
  • Set-associative cache
  • Replacement policy
  • Multilevel cache
  • Dependability
  • MTTF, MTTR, reliability, availability
  • Hamming code: SEC/DED code