[Computer Architecture] Multi-cylcle MIPS

4 분 소요

Review: Processor Performance

Program Execution time

= (#instructions)(cycles/instruction)(seconds/cycle)

= # instructions x CPI x T_C

HDL description

Review: Processor Performance

Program Execution Time

= (#instructions/program)(cycles/instruction)(seconds/cycle)

= # instructions x CPI x T_C

Single-Cycle Performance

Critical path

combinational logic에서 어떤 거는 빠르게 오고 어떤 거는 느리게 올텐데 이 중 가장 느린 경로가 critical path라고 한다.

clk cylce time을 결정하는 경로

T_C limited by critical path (lw)

clock period가 이를 수용할 수 있을 정도로 길어야 함.

Single-Cycle Performance

Single-cycle critical path:
- T_c = t_{pcq_PC} + t_mem + max(t_RFread, t_sext+ t_mux) + t_ALU+ t_mem+ t_mux+ t_RFsetup
Typically, limiting paths are:
- memory, ALU, register file
- Tc = t_{pcq_PC}+ 2_tmem + t_RFread+ t_mux+ t_ALU+ t_RFsetup
t_sext : sign extention

Example

T_C = t_{pcq_PC}+ 2_tmem + t_RFread+ t_mux+ t_ALU+ t_RFsetup

= [30 + 2(250) + 150 + 25 + 200 + 20] ps

= 925ps

Program with 100 billion instrucitons:

Execution Time = # instructions x CPI x T_C

= (100 x 10⁹) (1) (925 x 10^-12s)

= 92.5 seconds

Single cycle의 문제점

명령어 별로 걸리는 시간차이가 다르기 때문에(lw는 엄청 느리고, J 나 beq는 굉장히 빠르다) 가장 긴 instruction 시간에 맞춰서 clock time을 정해야 하기 때문에 짧은 instruction은 클락이 끝나지 않았는데도 이미 벌써 끝나서 다음 클락을 기다리기 때문에 비효율적인 상황이 발생한다.

그래서 이를 해결하기 위한 방법 두 가지를 앞으로 소개해 보도록 하겠다.

Multi-Cycle
Pipelining

Multi-cycle MIPS Processor

Single-cycle
- simple
- cycle time limited by longest instruction(lw)
- 2 adders/ALUs & 2 memories
Multicycle
- single에서는 모든 명령어의 시간이 같았다면 multi는 clock을 나누어서 빨리 끝나는 명령어와 늦게 끝나는 명령어가 있게 되었다.
- higher clock speed
- simpler instructions run faster
- reuse expensive hardware on multiple cycles (IM, DM, ALU 등)
- sequencing overhead paid many times
- 5단계로 쪼개서 진행할 거임
  - 나눴다는 얘기는 중간에 register file을 꼈다는 얘기
- 현재 어떤 cycle에 해당하냐에 따라 control signal이 다름
  - 그래서 state마다 값들이 달라야 하므로 FSM 사용->state transition diagram
Same design steps: datapath & control
faster clock이 관건
overhead: setup time, t_pcq

Q) 그럼 multi-cycle은 CPI가 5인 것인가?

A) 평균으로 구함

Multi-cycle State Elements

Replace Instruction and Data memories with a single unified memory - more realistic

Multi-cycle Datapath

위와 같이 나누어서 생각해 보자.(골고루 나누어야 효율적이기 때문)

그러면 중간 중간 데이터를 담아야할 register를 끼워 주어야 함(clock이 지나면 잊어버리기 때문에 저장해 놓는 것)

Instruction Fetch

STEP 1: Fetch instruciton

Instruction Fetch를 진행한다.
Instruction Register에 저장.
이 때, ALU는 놀고 있기 때문에 program + 4도 진행한다.

lw Register Read(or instruction decode)

STEP 2a: Read source operands from RF

opcode를 보고 이 코드를 통해 control에서 무슨 명령어인지 파악하여 각 control signal을 보내준다.

STEP 2b: Sign-extend the immediate

lw Address(or execute operation)

STEP 3: Compute the memory address

+ 연산을 진행하여 결과를 register에 저장.

lw Memory Read

STEP 4: Read Data from memory

IorD에 따라 data memory에 저장해야 하기 때문에 IorD는 1이 될 것이고 이 결과가 data memory에 저장된다.

data memory에 있는 것을 읽어서 asych하게 data register에 넣어준다.

lw Write Register

STEP 5: Write back data block to register file

Increment PC

Increment PC -> 사실상 step 1에서도 가능하다 (step1(fetch)에서 ALU는 놀고있기 떄문에)

Multicycle Datapath: sw

Write data in rt to memory

RF에 저장할 필요가 없기 때문에 4 cycle이면 된다.

Multi-cycle Datapath: R-Type

Read from rs and rt
Write ALUResult to register file
Write to rd (instead of rt)

Multicycle Datapath: beq

rs == rt?
BTA = (sign-extended immediate « 2) + (PC+4)

3번째 clock에서는 ALU를 쓰고 있기 때문에(빼기 연산) BTA를 계산하는 과정을 3번째 clock에는 넣으면 안 된다.

그러므로 두 번째 cycle에서 BTA를 만들어 놓고 3번째 clock에서 두 값의 조건을 확인하는 용도로 ALU를 사용한다.

Multicycle Processor

Finite State Machine이 사용됨

lw 명령에서는 총 5번에 해당하는 cycle이 진행되어 각 클락마다, 즉 현재 어떤 상태이냐에 따라 값이 다를 수 있기 때문에 FSM을 사용하는 것이다.

5단계로 쪼갰기 떄문에 4개의 register가 추가됨

Multicycle control

Control unit을 main controller와 ALU decoder 로 나누었음.

Control에는 Mux control과 Write enable 신호가 있다.

Main Controller FSM:

FSM: moore machine

Fetch

MUX select signals are listed only when their value matters; otherwise, they are don’t cares.
Enable signals are listed only when they are asserted; otherwise, they are 0.

Fetch

IR <- Mem[PC] // IorD = 0 , IRWrite

PC <- PC + 4 //AluSrcA, ……

Decode

A <- RF[A1]

B <- RF[A2]

control 신호가 필요없어서 안 씀

Address

lw or sw

ALUResult <- A + Imm

ALU register에 저장

lw

5 clock 걸림

Data <- Mem[ALUout] :4번째 ; data register에 저장하는 과정

RF[A3] <- Data : 5번째 ; data에 저장된 내용을 register file에 저장하는 과정

sw

memory에 write하고 다시 Fetch 과정으로 돌아감

R-Type

state에 따라 각각의 control signal이 다르기 때문에 FSM 사용하는 것.

beq

ALUout <- PC + 4 + Imm « 2 // 두 번째 cycle에서 BTA 계산

if (A == B) PC <- ALUout //세 번째 cycle

Extended Functionality: addi

하나는 A에서 하나는 immediate에서
사실상 S2, S9은 같기 때문에 위처럼 해도 되고 연결을 공유해도 됨

Extended Functionality: j

Main Controller FSM: j

Multicycle Processor

Twitter Facebook LinkedIn

Kang Chang Ryong

Review: Processor Performance

HDL description

Review: Processor Performance

Single-Cycle Performance

Critical path

Single-Cycle Performance

Example

Single cycle의 문제점

Multi-cycle MIPS Processor

Multi-cycle State Elements

Multi-cycle Datapath

Instruction Fetch

lw Register Read(or instruction decode)

lw Address(or execute operation)

lw Memory Read

lw Write Register

Increment PC

Multicycle Datapath: sw

Multi-cycle Datapath: R-Type

Multicycle Datapath: beq

Multicycle Processor

Multicycle control

Main Controller FSM:

Fetch

Fetch

Decode

Address

lw

sw

R-Type

beq

Extended Functionality: addi

Extended Functionality: j

Main Controller FSM: j

Multicycle Processor

공유하기

댓글남기기

참고

[OS] Virtual Memory

[HCI] 10. Universal Design-Multi-Sensory Systems

[HCI] 9. Dialogue Notations and Design 2

[HCI] 9. Dialogue Notations and Design 1