A comprehensive guide from first principles to full implementation โ covering cache theory, address translation, hardware design, and RISC-V integration.
Modern processors execute instructions at gigahertz speeds โ one clock cycle can be as short as 0.3 nanoseconds. Yet main memory (DRAM) requires 50โ100 nanoseconds to return data after a request. This means the CPU is 100โ300ร faster than memory. Without a solution, the CPU would sit idle waiting for every instruction fetch and data access.
Caches work because programs exhibit locality. There are two types:
These two properties are the reason caches are effective. A cache exploits temporal locality by keeping recently-used data close to the CPU, and exploits spatial locality by loading an entire cache line (typically 64 bytes) instead of just the single byte or word requested.
The L1 instruction cache (I-Cache) is the very first level after the CPU's fetch unit. It stores recently-fetched instructions so the processor doesn't have to reach all the way to DRAM every cycle. This is exactly what we will design in this document.
A cache is organized into cache lines (also called blocks). Each line stores a contiguous chunk of memory โ typically 64 bytes. We don't cache individual bytes because of spatial locality: if you need byte X, you'll probably need X+1 soon.
| Term | Definition | Typical Value (L1 I-Cache) |
|---|---|---|
| Cache Line / Block | The smallest unit of transfer between cache and memory | 64 bytes (16 ร 32-bit words) |
| Set | A group of N lines (ways) that a given address can map to | 32 or 64 sets |
| Way | One slot within a set; N-way = N choices per set | 4 ways (4-way) |
| Total Capacity | Sets ร Ways ร Line Size | 64ร4ร64 = 16 KB |
| Tag | Upper bits of address stored to verify identity | Remaining bits after index+offset |
| Index | Bits used to select which set to look in | logโ(#sets) bits |
| Offset | Bits to select a byte within the cache line | logโ(line size) bits |
| Valid Bit | 1-bit flag: is this line's data meaningful? | 1 bit per way |
Every memory access uses a physical or virtual address that is split into three fields. For a 32-bit address with a 16 KB, 4-way, 64-byte-line cache:
Note: for instruction cache, bits[1:0] are always 00 (word-aligned in RISC-V)
Each way in each set stores:
Valid(1) | Tag(20) | Data(64ร8 = 512 bits) = 533 bits per way. For all 4 ways and 64 sets: 64 ร 4 ร 533 โ 136 KB of raw SRAM. Plus ~14 bits of LRU state per set.
Each memory block maps to exactly one location in cache. Simple, fast โ but suffers from conflict misses when two heavily-used addresses share the same slot.
A block can be placed in any cache line โ no index field, just tag. No conflict misses, but requires comparing all tags simultaneously, which is expensive in hardware (many comparators).
The practical middle ground. The cache is divided into S sets, each holding N ways. A block maps to one set (chosen by the index bits) but can go into any of the N ways within that set. This eliminates most conflict misses while keeping hardware complexity manageable.
Every process running on a CPU uses virtual addresses โ a private address space that gives the illusion of having all of memory. The operating system and hardware MMU (Memory Management Unit) translate these into physical addresses โ real locations in DRAM.
Translation happens at page granularity. A page is typically 4 KB. The page table maps the upper bits of VA (the Virtual Page Number, VPN) to the upper bits of PA (the Physical Page Number, PPN). The lower bits (the page offset, 12 bits for 4 KB pages) are identical in both VA and PA โ this is the crucial VIPT insight.
Both the index and tag come from the virtual address. Fast (no TLB needed) but suffers from aliasing (two different virtual addresses in different processes map to the same PA โ the cache will contain duplicate data) and homonyms (same VA, different PA in different processes โ the cache may return wrong data). Requires full cache flush on context switch. Rare in modern CPUs.
Both index and tag come from the physical address. Correct by design โ no aliasing or homonym problems. But the TLB must complete before the cache SRAM can even be addressed. This adds the full TLB latency to every cache hit โ critical path penalty for L1.
Index comes from the virtual address; tag comes from the physical address. The key insight: use the TLB and SRAM in parallel. While the cache SRAM is being read (indexed by VA bits), the TLB translates the upper virtual bits to physical bits. Both complete at roughly the same time, then tag comparison uses the physical tag.
VIPT introduces a subtle problem called virtual aliasing: two virtual pages can map to the same physical page. If they both index into different cache sets, the same data will be stored twice โ inconsistently.
| Strategy | Index Source | Tag Source | TLB on Critical Path? | Aliasing? | Used In |
|---|---|---|---|---|---|
| VIVT | Virtual | Virtual | No | Yes (severe) | Old ARM, rare |
| PIPT | Physical | Physical | Yes | No | D-Cache, L2/L3 |
| VIPT | Virtual | Physical | No (parallel) | Conditional | L1 I-Cache (most modern CPUs) |
Our design targets a 32-bit RISC-V core with a 16 KB, 4-way set associative, 64-byte line instruction cache. Here are all the parameters derived step by step:
| Parameter | Value | Derivation |
|---|---|---|
| Address width | 32 bits | RISC-V RV32I |
| Cache capacity | 16 KB | Design choice |
| Associativity | 4 ways | Design choice |
| Cache line size | 64 bytes | Design choice (matches burst length) |
| Number of sets | 64 | 16384 / (4 ร 64) = 64 |
| Block offset bits | 6 | logโ(64) = 6, bits[5:0] |
| Index bits | 6 | logโ(64) = 6, bits[11:6] |
| Tag bits (physical) | 20 | 32 โ 6 โ 6 = 20, bits[31:12] |
| Page offset (4 KB page) | 12 bits | logโ(4096) = 12 |
| Alias-free check | 6+6=12 โค 12 โ | VIPT safe |
| SRAM per way | (1+20+512) = 533 bits ร 64 | V + Tag + Data per line |
All four tag comparisons happen simultaneously in hardware โ this is what associativity means in implementation. The logic for each way is:
On a miss, we must evict one of the 4 ways. The most accurate policy is True LRU (Least Recently Used), but it requires logโ(4!) = 5 bits per set. A practical approximation is Pseudo-LRU (PLRU) using a 3-bit binary tree per set.
The instruction cache is read-only from the CPU's perspective. The fetch unit never writes instructions โ it only reads them. This has major design implications:
A problem arises with JIT compilers or loaders that write instructions to memory via the D-cache, then try to execute them. The I-cache and D-cache are separate; writing through the D-cache doesn't automatically update the I-cache. The program may fetch stale instructions.
FENCE.I instruction (Zifencei extension) to synchronize the instruction stream. When executed, it guarantees that any prior stores are visible to subsequent instruction fetches. Hardware implementations typically respond by flushing and invalidating the entire I-cache.
With VIPT using physical tags, the cache is naturally process-safe โ a tag miss from a different process's PA simply won't match. However, if the OS reuses a physical page for a different virtual mapping, stale I-cache data with the old physical tag could match. This is solved by:
A standard 5-stage RISC-V pipeline (as in the reference Rocket or CVA6 cores) operates as:
The IF (Instruction Fetch) stage drives the I-cache every cycle with the current PC. The cache must return the instruction within the same cycle (or signal a stall). For a typical 1 GHz embedded RISC-V core, the I-cache must deliver a result in ~1 ns โ this means the SRAM access time and tag comparison must fit within one clock cycle.
RISC-V RV32I instructions are always 4 bytes (32 bits) wide and word-aligned. This means:
00 โ they are never used for cache indexing.In a single-core RISC-V system, I-cache coherence is managed manually via FENCE.I. In multi-core systems, a hardware coherence protocol (MESI, MOESI) maintains consistency. The I-cache typically operates in a read-only invalid/valid state machine โ it never produces modified data, so it participates in coherence as a snooper only.
The I-cache is on the critical path of the processor โ it determines the minimum clock period. The critical path flows:
This is why L1 caches are kept small (16โ32 KB) โ larger SRAMs have longer access times, violating timing. L2 caches run slower and can be larger.
PC = 0x80001234. Is there a hit?
| Parameter | Our Design | Alternative (32KB, 8-way) |
|---|---|---|
| Total Size | 16 KB | 32 KB |
| Ways | 4 | 8 |
| Sets | 64 | 64 |
| Line size | 64 bytes | 64 bytes |
| Offset bits | 6 (bits[5:0]) | 6 (bits[5:0]) |
| Index bits | 6 (bits[11:6]) | 6 (bits[11:6]) |
| Tag bits | 20 (bits[31:12]) | 20 (bits[31:12]) |
| VIPT alias-free? | โ 6+6=12โค12 | โ 6+6=12โค12 |
| LRU bits/set | 3 (PLRU tree) | 7 (PLRU tree) |
| Tag comparators | 4 | 8 |
| SRAM area | ~0.02 mmยฒ (28nm) | ~0.04 mmยฒ (28nm) |
| Typical hit rate | ~97โ98% | ~98โ99% |
| Access time | ~1 cycle | ~1โ2 cycles |
| Term | Definition |
|---|---|
| Associativity (N-way) | Number of cache lines within a set where a block can be placed |
| ASID | Address Space Identifier โ process tag added to TLB/cache to avoid flushes on context switch |
| Cache Line / Block | The atomic unit of data moved between cache levels (typically 64 bytes) |
| Cold Miss | Miss because data has never been loaded (compulsory miss) |
| Conflict Miss | Miss because two lines compete for the same set (eliminated by more ways) |
| Capacity Miss | Miss because cache is too small to hold the working set |
| Critical Path | Longest combinational logic path that limits maximum clock frequency |
| D-Cache | Data cache โ serves load/store instructions in the MEM pipeline stage |
| DRAM | Dynamic RAM โ main memory, slow (50โ100 ns) but large (GBs) |
| FENCE.I | RISC-V instruction to flush I-cache and synchronize instruction stream |
| Hit / Miss | Hit: requested data found in cache. Miss: not found, must fetch from lower level |
| I-Cache | Instruction cache โ supplies instructions to the IF stage of the pipeline |
| Index | Address bits used to select which set to look in |
| LRU | Least Recently Used โ eviction policy that removes the line accessed longest ago |
| MMU | Memory Management Unit โ hardware that performs VAโPA translation |
| Offset | Address bits selecting a byte within a cache line |
| PA (Physical Address) | Real address in DRAM after MMU translation |
| Page | Fixed-size block of memory (typically 4 KB) โ the unit of VAโPA mapping |
| PIPT | Physically Indexed, Physically Tagged โ correct but TLB is on critical path |
| PLRU | Pseudo-LRU โ approximation of LRU using a binary tree, fewer bits |
| PPN | Physical Page Number โ upper bits of a physical address |
| RISC-V | Open-source reduced instruction set computer architecture |
| Set | A group of N ways in a cache; an address maps to exactly one set |
| Spatial Locality | Tendency to access addresses near recently-used addresses |
| SRAM | Static RAM โ fast (sub-nanosecond), used for cache storage |
| Tag | Upper address bits stored alongside data to verify a cache line's identity |
| Temporal Locality | Tendency to re-access recently-used memory locations |
| TLB | Translation Lookaside Buffer โ fast cache of recent VAโPA translations |
| VA (Virtual Address) | Address as seen by the program; private per-process view |
| Valid Bit | 1-bit flag per cache way indicating whether stored data is meaningful |
| VIPT | Virtually Indexed, Physically Tagged โ indexes via VA, tags via PA; fast and safe if alias condition met |
| VIVT | Virtually Indexed, Virtually Tagged โ fast but prone to aliasing and homonym problems |
| VPN | Virtual Page Number โ upper bits of a virtual address |
| Way | One slot within a set โ N-way cache has N ways per set |
| Write-Back | D-cache policy: write to cache only, flush to memory on eviction |
| Write-Through | D-cache policy: every write goes to both cache and memory immediately |
4-Way Set Associative VIPT Instruction Cache for RISC-V
From first principles to full SoC integration
To save as PDF: Open in browser โ Print โ Save as PDF โ Set paper to A4/Letter, enable Background Graphics