A complete engineering reference covering computer architecture foundations, the RISC-V ISA, all six instruction types, and a full SystemVerilog implementation of a single-cycle RV32I core — Fetch, Decode, Register File, ALU, Data Memory, and Control Unit.
1.3 ISA — The Contract Between Software and Hardware
Chapter 2: RISC-V — Philosophy & Classification
2.1 What Makes RISC-V Different
2.2 Unprivileged vs Privileged ISA
2.3 Base ISA Variants: RV32I and RV64I
Chapter 3: Processor Execution Models
3.1 In-Order Processors
3.2 Out-of-Order Processors
3.3 Why YARP is Single-Cycle In-Order
Part II — RV32I Instruction Set Architecture
Chapter 4: Instruction Formats & Encoding
4.1 The Six Instruction Types
4.2 R-Type: Register-Register Operations
4.3 I-Type: Immediate & Load Operations
4.4 S-Type: Store Operations
4.5 B-Type: Branch Operations
4.6 U-Type: Upper Immediate Operations
4.7 J-Type: Jump Operations
Chapter 5: Programmer's Model — Registers & PC
5.1 The 32 General-Purpose Registers
5.2 The Program Counter
5.3 X0: The Hardwired Zero Register
Part III — YARP Microarchitecture
Chapter 6: YARP Overview & Full Datapath
6.1 Single-Cycle Execution Philosophy
6.2 The Five Functional Blocks
Chapter 7: Fetch Stage (IMEM)
7.1 Signals & Interface
7.2 RTL Implementation
Chapter 8: Decode Stage
8.1 Opcode Map & Type Detection
8.2 Immediate Reconstruction
8.3 RTL Implementation
Chapter 9: Register File
9.1 Interface & Requirements
9.2 X0 Immutability & Read Latency
9.3 RTL Implementation
Chapter 10: Execute Stage — ALU
10.1 ALU Operations
10.2 Signed vs Unsigned Operations
10.3 RTL Implementation
Chapter 11: Data Memory (DMEM)
11.1 Load & Store Instructions
11.2 Byte, Half-Word, Word Access
11.3 RTL Implementation
Chapter 12: Control Unit
12.1 Control Signals Overview
12.2 MUX Selects & Data Routing
12.3 Complete Control Truth Table
Part IV — Reference
Chapter 13: Worked Execution Examples
Chapter 14: Glossary
Chapter 1 — Part I: Foundations
Computer Architecture & The Processor
1.1 What is Computer Architecture?
Computer architecture describes the structure, organization, and functionality of a computer system. It defines how the major components — processor, memory, and I/O devices — are designed and how they interact with each other to execute programs.
Figure 1.1 — Computer System Components
1.2 The Role of the Processor
The processor's primary function is to execute instructions. A user writes a program in a high-level language (C, Python). That program is compiled to assembly, then assembled into binary — strings of 1s and 0s that the processor understands directly.
1.3 ISA — The Contract Between Software and Hardware
The Instruction Set Architecture (ISA) defines the contract between software and hardware. It specifies:
The complete set of instructions the hardware can execute
How those instructions are encoded in binary
The registers visible to the programmer
Memory addressing rules and alignment constraints
Behaviour of arithmetic, logic, load/store, and control flow operations
Key Insight
The ISA is what allows software to be compiled once and run on any processor that implements that ISA — regardless of the underlying microarchitecture.
Chapter 2
RISC-V — Philosophy & Classification
2.1 What Makes RISC-V Different
RISC-V (pronounced "risk-five") is an open-source, royalty-free ISA originally developed at UC Berkeley for research and education. Unlike proprietary ISAs (x86, ARM), anyone can implement RISC-V without licensing fees. It has since grown into a mainstream architecture adopted by industry — including AMD, Western Digital, Google, and many SoC vendors.
Design Philosophy
Clean, minimal base ISA with optional extensions. The base integer ISA (I) is deliberately small — sufficient for compilers, OS kernels, and embedded systems — with optional extensions (M for multiply, F for float, etc.) added only when needed.
Avoid Over-Architecting
RISC-V avoids committing to a specific microarchitecture style. The ISA is defined by what an instruction does, not how the hardware implements it.
2.2 Unprivileged vs Privileged ISA
RISC-V separates its specification into two distinct parts:
Feature
Unprivileged ISA
Privileged ISA
Who uses it
User programs
Operating system / hypervisor
Hardware access
No
Yes — full control
System control
No
Yes — full control
Examples
ADD, LOAD, STORE
Set page table, I/O control
CPU mode
User mode
Kernel / Machine mode
Why This Separation Exists
Without it, any user program could crash the system, access other programs' memory, or control hardware directly — leading to catastrophic security and stability failures. CPUs enforce: User Mode → only unprivileged ISA. Kernel Mode → access to privileged ISA.
2.3 Base ISA Variants: RV32I and RV64I
There are two primary base integer variants. XLEN refers to the integer register width in bits.
Variant
XLEN
Address Space
Use Case
RV32I
32 bits
32-bit (4 GB)
Embedded, microcontrollers, our YARP design
RV64I
64 bits
64-bit (16 EB)
Linux servers, application processors
M Extension (not in base)
By default, the base RV32I ISA does NOT include hardware multiply or divide. Integer multiply (MUL, DIV, REM) requires the optional M extension. Our YARP processor implements the pure RV32I base — no M extension.
Chapter 3
Processor Execution Models
3.1 In-Order Processors
An in-order processor executes instructions exactly in the sequence they appear in the program. If one instruction stalls (e.g., waiting for data from memory), every instruction behind it must wait too.
An out-of-order (OOO) processor looks ahead at multiple instructions and executes independent ones as soon as their inputs are ready, even if earlier instructions are stalled. Results are committed back in the original program order to preserve correctness (called precise state).
OOO Hardware Requirements
Instruction window · Reorder buffer (ROB) · Dependency checking logic · Reservation stations. This is what makes OOO processors complex and power-hungry.
Figure 3.1 — In-Order vs Out-of-Order Execution
3.3 Why YARP is Single-Cycle In-Order
A single-cycle processor completes every instruction in exactly one clock cycle. Each instruction flows through: Fetch → Decode → Execute → Memory → Register Write-Back — all within a single cycle. The next instruction does not begin until the current one completes fully.
✓ In-Order
Instructions execute strictly sequentially. No reordering, no speculation.
✗ No Pipelining
All five stages active in the same cycle for one instruction.
✗ No Hazard Handling
Since only one instruction is in flight, data/control hazards cannot arise.
One-Line Summary
An RV32I single-cycle processor is strictly in-order because it executes one instruction completely per cycle — with no overlap, no reordering, and no dynamic scheduling.
Chapter 4 — Part II: RV32I Instruction Set Architecture
Instruction Formats & Encoding
4.1 The Six Instruction Types
All RV32I instructions are exactly 32 bits wide and word-aligned. The first 7 bits (instr[6:0]) are the opcode — they tell the processor which type of instruction this is and what operation to perform. There are six instruction formats:
Type
Opcode
Purpose
Examples
R-type
0x33
Register-register arithmetic/logic
ADD, SUB, AND, OR, XOR, SLL, SRL, SRA, SLT
I-type (arith)
0x13
Register-immediate arithmetic
ADDI, ANDI, ORI, XORI, SLTI, SLLI, SRLI, SRAI
I-type (load)
0x03
Load from memory
LW, LH, LHU, LB, LBU
I-type (JALR)
0x67
Jump and link register
JALR
S-type
0x23
Store to memory
SW, SH, SB
B-type
0x63
Conditional branch
BEQ, BNE, BLT, BGE, BLTU, BGEU
U-type
0x17, 0x37
Upper immediate operations
AUIPC, LUI
J-type
0x6F
Unconditional jump
JAL
4.2 R-Type: Register-Register Operations
R-type instructions operate on two source registers (rs1, rs2) and write the result to a destination register (rd). The funct3 and funct7 fields together identify the specific operation.
I-type instructions use one source register (rs1) and a 12-bit sign-extended immediate as the second operand. The result goes to rd. The same format is used for arithmetic immediates, loads, and JALR.
Figure 4.2 — I-Type Instruction Format
31 — 20
imm[11:0] (12b, sign-extended)
19 — 15
rs1 (5b)
14 — 12
funct3 (3b)
11 — 7
rd (5b)
6 — 0
opcode (7b)
Instruction
opcode
Operation
ADDI rd, rs1, imm
0x13
rd = rs1 + sign_ext(imm)
LW rd, imm(rs1)
0x03
rd = Mem[rs1 + sign_ext(imm)]
JALR rd, rs1, imm
0x67
rd = PC+4; PC = rs1 + imm
4.4 S-Type: Store Operations
S-type instructions store a register value to memory. The immediate is split across two fields (to keep rs1, rs2, funct3 in the same bit positions as R-type — simplifying decoder logic). There is no rd field — no result is written to a register.
B-type is similar to S-type but encodes a PC-relative branch offset. The immediate is again split and scrambled to maintain rs1/rs2/funct3 register field alignment. Note: bit 0 of the offset is always 0 (branches target word-aligned addresses).
Figure 4.4 — B-Type Instruction Format
31—25
imm[12|10:5]
24—20
rs2 (5b)
19—15
rs1 (5b)
14—12
funct3 (3b)
11—7
imm[4:1|11]
6—0
opcode (7b)
Instruction
funct3
Condition
BEQ rs1, rs2, offset
0x0
Branch if rs1 == rs2
BNE rs1, rs2, offset
0x1
Branch if rs1 ≠ rs2
BLT rs1, rs2, offset
0x4
Branch if rs1 < rs2 (signed)
BGE rs1, rs2, offset
0x5
Branch if rs1 ≥ rs2 (signed)
BLTU rs1, rs2, offset
0x6
Branch if rs1 < rs2 (unsigned)
BGEU rs1, rs2, offset
0x7
Branch if rs1 ≥ rs2 (unsigned)
4.6 U-Type: Upper Immediate Operations
U-type instructions load a 20-bit immediate into the upper 20 bits of a destination register (bits [31:12]), with the lower 12 bits set to zero.
Figure 4.5 — U-Type Instruction Format
31 — 12
imm[31:12] — 20-bit upper immediate
11 — 7
rd (5b)
6 — 0
opcode (7b)
LUI (Load Upper Immediate) — opcode 0x37
rd = {imm[31:12], 12'b0} Directly loads a 20-bit constant into the upper portion of rd.
AUIPC (Add Upper Immediate to PC) — opcode 0x17
rd = PC + {imm[31:12], 12'b0} Used to build PC-relative addresses. Paired with ADDI to form any 32-bit address.
4.7 J-Type: Jump Operations
J-type (JAL) encodes a 21-bit PC-relative offset into a heavily scrambled immediate field — the scrambling keeps rs1 aligned with other types to reduce mux complexity in the decoder.
Figure 4.6 — J-Type Instruction Format
31
imm[20]
30 — 21
imm[10:1]
20
imm[11]
19 — 12
imm[19:12]
11 — 7
rd (5b)
6 — 0
opcode (7b)
JAL rd, offset
rd = PC + 4 (save return address)
PC = PC + sign_ext(offset) (jump to target)
Note: offset is always even (bit[0]=0, branch to word-aligned)
Chapter 5
Programmer's Model — Registers & PC
5.1 The 32 General-Purpose Registers
RV32I provides 32 integer registers, each 32 bits wide (XLEN=32). They are named x0 through x31. Register addresses are 5 bits wide (2⁵ = 32 addresses), which is why rs1, rs2, and rd fields in instruction encodings are all 5 bits.
Figure 5.1 — RV32I Programmer's Model
5.3 X0: The Hardwired Zero Register
Register x0 always reads as zero and ignores all writes. This is not a software convention — it is enforced in hardware. This enables useful pseudo-instructions:
In YARP, every instruction completes within a single clock cycle. The full datapath — from instruction fetch to register writeback — is combinational (or registered at the very end). This means:
No pipeline registers between stages
No hazard detection or forwarding logic needed
No out-of-order scheduling hardware
The clock period must accommodate the worst-case instruction (typically a load)
Per-cycle operation:
Fetch → Decode → [Register Read] → ALU/Execute → Data Memory → Register Write-Back
─────────────────────────────────────────────────────────────────────
All of the above happens in ONE clock cycle for ONE instruction.
6.2 The Five Functional Blocks
Figure 6.1 — YARP Full Single-Cycle Datapath
The Control Unit receives instruction type signals from the decoder and generates all the mux-select signals and enable bits that route data correctly through the datapath. It is pure combinational logic — no state.
Chapter 7
Fetch Stage — YARP IMEM
7.1 Signals & Interface
The Fetch stage is responsible for requesting instructions from memory every cycle. It takes the current PC value, asserts a read request to instruction memory, and forwards the returned 32-bit instruction to the Decode stage.
Figure 7.1 — YARP Fetch Module Block Diagram
Signal
Direction
Width
Description
instr_mem_pc_i
Input
32b
Current PC — address of instruction to fetch
mem_rd_data_i
Input
32b
Instruction word returned by instruction memory
instr_mem_req_o
Output
1b
Fetch request signal to instruction memory (asserted when active)
instr_mem_addr_o
Output
32b
Address sent to instruction memory — equals PC
instr_mem_instr_o
Output
32b
Fetched instruction forwarded to the Decode stage
7.2 Fetch Logic Explained
Since this is a single-cycle processor, instr_mem_addr_o is simply wired to instr_mem_pc_i — the PC is the instruction address. The fetch request (instr_mem_req_o) is de-asserted synchronously on reset using an async-negedge flip-flop, then permanently asserted once out of reset.
Design Note — Why Async Reset for req?
The instruction memory request must be deasserted the instant reset is asserted, even mid-cycle. An asynchronous negedge-reset FF achieves this without waiting for a clock edge.
7.3 RTL Implementation
// YARP Instruction Memory — Fetch Stage RTLmoduleyarp_instr_mem (
inputlogic clk,
inputlogic reset_n,
inputlogic [31:0] instr_mem_pc_i, // PC address value// Memory interface — read request to IMEMoutput logic instr_mem_req_o,
output logic [31:0] instr_mem_addr_o,
inputlogic [31:0] mem_rd_data_i, // instruction from memory// Instruction to decoderoutput logic [31:0] instr_mem_instr_o
);
// Assert req after reset deasserts (async negedge reset)always_ff @(posedge clk, negedge reset_n)
if (~reset_n) instr_mem_req_o <= 1'b0;
else instr_mem_req_o <= 1'b1;
// PC passes directly to memory address portassign instr_mem_addr_o = instr_mem_pc_i;
// Memory data is the instruction — forward to decoderassign instr_mem_instr_o = mem_rd_data_i;
endmodule
Chapter 8
Decode Stage — yarp_decode
8.1 Opcode Map & Type Detection
The Decode unit receives the raw 32-bit instruction from Fetch and performs three jobs simultaneously: identify the instruction type, extract register addresses and function fields, and reconstruct the sign-extended immediate value.
Figure 8.1 — Opcode to Instruction Type Mapping
8.2 Immediate Reconstruction
Each instruction type encodes its immediate differently — bits are scattered across the instruction word to keep rs1/rs2/funct3 aligned. The decoder reassembles these into a clean 32-bit sign-extended immediate:
Why Bit Scrambling?
The RISC-V ISA intentionally scrambles immediate bits so that rs1 (bits 19:15), rs2 (bits 24:20), and funct3 (bits 14:12) always sit in the same bit positions across all instruction types. This reduces the number of muxes needed in the decode unit.
The Register File is the central data store of the processor. For every instruction, it simultaneously provides data from up to two source registers (rs1, rs2) and accepts a write-back value into the destination register (rd) — all in the same cycle.
Figure 9.1 — Register File Interface
Signal
Dir
Width
Description
rs1_addr_i
In
5b
Address of source register 1 (from decode)
rs2_addr_i
In
5b
Address of source register 2 (from decode)
rd_addr_i
In
5b
Destination register address (write port)
wr_en_i
In
1b
Write enable — from control unit
wr_data_i
In
32b
Data to write into rd (from ALU/memory/PC+4)
rs1_data_o
Out
32b
Data read from rs1 — goes to ALU operand A
rs2_data_o
Out
32b
Data read from rs2 — goes to ALU operand B or memory write data
9.2 X0 Immutability & Read Latency
Requirement
RV32I Specification
YARP Implementation
X0 value
Always reads 0, writes ignored
Write guarded: only write if rd_addr_i ≠ 0. Read: always returns 0 for addr=5'b0
Read latency
Combinational (same cycle)
Assign statements — no clock needed for reads
Write timing
Registered — takes effect on next cycle or same cycle for single-cycle
moduleyarp_regfile (
inputlogic clk, reset_n,
inputlogic [4:0] rs1_addr_i, rs2_addr_i,
inputlogic [4:0] rd_addr_i,
inputlogic wr_en_i,
inputlogic [31:0] wr_data_i,
output logic [31:0] rs1_data_o, rs2_data_o
);
// 32 registers, each 32-bit widelogic [31:0] regfile [31:0];
// ── Write port (registered) ────────────────────────────always_ff @(posedge clk, negedge reset_n) begin
if (~reset_n) begin
for (int i = 0; i < 32; i++) regfile[i] <= 32'b0;
end else begin// Write protection for X0: only write if rd_addr_i != 0if (wr_en_i && (rd_addr_i != 5'b0))
regfile[rd_addr_i] <= wr_data_i;
end
end
// ── Read ports (combinational — zero-latency) ──────────// X0 always returns 0 regardless of stored valueassign rs1_data_o = (rs1_addr_i == 5'b0) ? 32'b0 : regfile[rs1_addr_i];
assign rs2_data_o = (rs2_addr_i == 5'b0) ? 32'b0 : regfile[rs2_addr_i];
endmodule
Four Key Design Decisions
1. X0 is hardwired to zero via write-guard AND read-mux. 2. Combinational reads eliminate read-after-write hazards in single-cycle. 3. X0 forced to return 0 on reads. 4. Eliminates read-after-write hazard since read is purely combinational.
Chapter 10
Execute Stage — ALU (yarp_execute)
10.1 ALU Operations
The ALU (Arithmetic Logic Unit) is a pure combinational block that takes two 32-bit operands and a 4-bit operation selector, and outputs a 32-bit result. In YARP, the ALU covers all integer computation defined in the RV32I base ISA.
ALU Op
Operation
RV32I Instructions
OP_ADD
A + B
ADD, ADDI, LW, SW, AUIPC, JAL, JALR, branches
OP_SUB
A − B
SUB
OP_SLL
A << B[4:0]
SLL, SLLI
OP_SRL
A >> B[4:0] (logical)
SRL, SRLI
OP_SRA
A >>> B[4:0] (arithmetic)
SRA, SRAI
OP_OR
A | B
OR, ORI
OP_AND
A & B
AND, ANDI
OP_XOR
A ^ B
XOR, XORI
OP_SLTU
A < B (unsigned) → 1 or 0
SLTU, SLTIU
OP_SLT
A < B (signed) → 1 or 0
SLT, SLTI
10.2 Signed vs Unsigned Operations
SLTU / SLTIU — Unsigned Comparison
Treats both operands as unsigned 32-bit integers. Comparison happens on the raw bit pattern (no sign extension).
SLT / SLTI — Signed Comparison
Treats both operands as 2's complement signed integers. If bit[31] = 1, the value is negative. A signed negative value is always less than any positive value.
The Data Memory stage handles the two instruction classes that access external memory:
Load Instructions
Read data FROM memory into a register. The effective address is computed by the ALU: rs1 + sign_ext(imm). LW, LH, LHU, LB, LBU.
Store Instructions
Write data FROM a register TO memory. Address = rs1 + sign_ext(imm). SW, SH, SB. No register write-back occurs.
All other arithmetic instructions (ADD, AND, JAL, etc.) skip this stage and route the ALU result directly to the register write-back path.
11.2 Byte, Half-Word, Word Access
RV32I supports three access widths. Memory is always byte-addressable. Address alignment constraints apply:
Instruction
Access Width
Bytes
Address Alignment
Sign Extend?
LW / SW
Word
4
addr[1:0] = 2'b00 (4-byte aligned)
N/A (full 32 bits)
LH / SH
Half-word
2
addr[0] = 0 (2-byte aligned)
LH: sign-extended to 32b
LHU
Half-word
2
addr[0] = 0
Zero-extended to 32b
LB / SB
Byte
1
Any address
LB: sign-extended to 32b
LBU
Byte
1
Any address
Zero-extended to 32b
Alignment Rule — Word Store Example
Valid SW addresses (last 2 bits must be 00): 0x0000, 0x0004, 0x0008, 0x000C…
Valid SH addresses (last bit must be 0): 0x0000, 0x0002, 0x0004…
SB: any address is valid.
The Control Unit is the "brain" of the datapath. It takes instruction type signals from the Decoder and generates all the mux-select and enable signals that route data correctly through the processor. It is entirely combinational logic — no state registers.
Figure 12.1 — Control Unit Inputs and Outputs
12.2 MUX Selects & Data Routing
Control Signal
Width
Meaning
pc_sel_o
1b
0 = PC+4 (sequential) | 1 = Branch/Jump target from ALU
op1sel_o
1b
0 = rs1 register data | 1 = Current PC (for AUIPC, JAL)
op2sel_o
1b
0 = rs2 register data | 1 = Sign-extended immediate
alu_func_o
4b
ALU operation select — matches OP_ADD, OP_SUB… enum
rf_wr_en_o
1b
1 = instruction writes to register file, 0 = no write (S/B type)
rf_wr_data_o
2b
00=ALU result | 01=DMEM read data | 10=32-bit immediate | 11=PC+4
data_req_o
1b
1 = this instruction accesses data memory (Load or Store)
data_wr_o
1b
1 = write to memory (Store) | 0 = read from memory (Load)