ARCAL121

Università degli Studi di Siena

Dipartimento di Ingegneria dell'Informazione e Scienze Matematiche (DIISM)

Insegnamento di

Architettura dei Calcolatori 2020-2021

bgcolor="#FFFFFF" width="840" height="1160" border="0" cellpadding="0" cellspacing="0" valign="top" align="left">

LAB EXERCISE: Understanding the advantage of forwarding logic

PLEASE NOTE
This can be tested also by using our simulator WebRISC-V.
(WebRISC-V is a web-based graphical pipelined datapath simulation environment built for the RISC-V instruction set architecture. It is suitable for teaching how assembly is executed on a RISC-V pipelined architecture and for illustrating the datapath architectural elements. See here for an illustrating paper).

If you use WebRISC-V, set the following WebRISC-V execution options:	Jump Control Hazard Resolution → Execute Delay Slot
	Forwarding Inside Pipeline → Activated or Deactivated

EXERCISE TEXT

A computer with a RISC-V processor having a clock frequency of 2GHz runs the following program:

Note: the register x3 is preloaded with the GP, i.e., the base pointer to the static-data area.

.text
addi x12, x0, 2
addi x10, x3, 288
loop:
beq x10, x3, esci
lw x5, 100(x10)
add x5, x5, x12
sw x5, 200(x10)
j loop
addi x10, x10, -4
esci:
addi x0, x0, 0

Assume that the RISC-V processor permits the writing and reading of a register in the same clock cycle, that it is possible to exploit the so-called "delay-slot" generated by jump instructions, that it is possible to decide the jump in the decoding stage. Calculate the speed-up (with at least 4 digits of precision) of the execution with the pipeline propagation path (forwarding) enabled compared with the case in which they are disabled.

Extract of RISC-V instruction table (RV64IM)

Instruction coding (hexadecimal)
opcode+funct3+{funct7,imm} Instruction Example Meaning Comments
(** instructions available only in RV64)

33+0+00/3b+0+00 add add/addw x5,x6,x7 x5 ← x6+x7 Add two operands; exception possible (addw**)

13+0+imm/1b+0+imm add immediate addi/addiw x5,x6,100 x5 ← x6+100 Add a constant; exception possible (addiw**)

03+3+imm/03+2+imm/03+0+imm load dword/word/byte ld/lw/lb x5,100(x6) x5 ← MEM[x6+100] Data from memory to register

PSEUDOINSTRUCTION jump j/b 1000 go to 1000 Encoded as: jal x0,offset/beq x0,x0,offset

PROCEDURE

Once the pipeline operation diagram has been drawn (F=Fetch, D=Decode, X=Execute, M=data Memory access, W=Write-back), noting in particular that, due to the delay slot after beq and j, in any case the instructions immediately following the jump and branch instructions get executed, and respecting the stalls generated by the dependencies between the registers (highlighted in the figure with the labels 1, 2, 3, 4), it results that the cycles required to execute the program are:

CWO-F(I) = 5+I*12+6 = 11+I*12

The I factor is due to the repetition of the central portion of the code that is inside the cycle labeled "loop". In our case I = 72, as the number of repetitions is conditioned by the value N = 288 loaded by the second instruction of the program. This value is decreased by 4 at each cycle, so the number of iterations is equal to 288/4 = 72.

CWO-F(9) = 875

In the event that forwarding is enabled, it is possible to save a few cycles, respectively in the cases:

1) it is possible to propagate the result of the sum ("addi") to stage D where the comparison between the two operands is made for the branch decision ("beq");
2) it is possible to propagate the result of the memory read ("lw") to stage X to perform the sum ("add");
3) it is possible to propagate the result of the sum ("add") to stage M in which the data memory is accessed to perform a write operation (“sw”);
4) it is possible to propagate the result of the sum ("addi") to stage D (as in case 1); note this functional dependence is generated by the fact that the delay-slot is active on the jump instruction ("j loop").

It therefore results (with forwarding enabled):

CF(I) = 4+I*8+6 = 10+I*8

That is:

CF(9) = 586

The speed-up, at last, is (since the clock frequency does not change in either case):

S = TCPUWO-F/TCPUF = (CCPUWO-F*fC ) / (CCPUWO-F*fC) = CCPUWO-F / CCPUF = 875/586 ~= 1.493