riscv-isa-manual icon indicating copy to clipboard operation
riscv-isa-manual copied to clipboard

Counters draft clarification for RDCYCLE when no cycle count

Open jrmoserbaltimore opened this issue 4 years ago • 2 comments

The Counters 2.0 draft indicates six pseudo-instructions: RDCYCLE[H], RDTIME[H], and RDINSTRET[H].

RDTIME[H] and RDINSTRET[H] are straightforward enough in all implementations. RDCYCLE[H] presents some issues in implementations using no global CPU clock. Individual instructions may have different clock rates in the ALU, or may use NULL Convention Logic or other method of delay-insensitive timing.

Some creative approximations, such as a count of RAM clock cycles or retired instructions plus retired pipeline bubbles, may hold some form of meaning; although a pipeline bubble is always a single bubble no matter how much time has passed. A measure of real-time by performing a NOP on start and measuring its real-time performance doesn't work: the amount of real-time required to execute an instruction increases when the CPU is hotter and decreases when the CPU is colder or given a higher Vcore.

Perhaps an appropriate definition of a clock cycle in relation to the cycle register would be:

  • One (1) global clock cycle if present; or
  • If no global clock, one (1) standard operation implemented to increment the cycle register continuously

In NCL-driven implementations, the latter would perform some operation and increment the cycle register at completion. The rate of cycle increments would increase or decrease as this executes. Implementation would need to ensure proper operation of RDCYCLE[H].

I am uncertain the merits of strictly-defining the standard operation, defining the standard operation to be identical to a certain ALU operation, or leaving the definition up to the implementation. I lean toward making it identical to a certain ALU operation, notably 32-bit integer addition.

Defining the standard operation to be identical to a certain ALU operation may avoid inflating instructions-per-clock performance measures by slowing down that operation, and as a side-effect makes such measurements meaningless.

This also has merit of allowing the implementation to make the space-speed trade-off: if cycle is defined as counting 32-bit integer addition (discarded) of 0xDEADBEEF and 0xCAFEF00D, an NCL implementation could increment the cycle register as a side-effect of calling the adder. The implementation could repeatedly perform this addition, which will stall the instruction pipeline when executing an ADD, but the instruction pipeline takes precedence and so executes immediately when the counter addition completes.

If this constant use of the adder raises the temperature of the adder, it will slow down; hence an implementation might opt to use a separate, identical adder to increment cycle so as to not impact the performance of the adder. Multi-hart or SMT implementations may simply rotate through all available adders to avoid heating up just one, in which case they must only count in normal addition operations when there is contention between the cycle counter loop and instruction execution.

Such a definition would make "32-bit additions per real-time second at temperature" a chip performance measure, with one or more temperatures given. Possible measurements in extensions would include:

  • RV64I: 64-bit additions per 1,000,000 32-bit additions.
  • M: MUL[W] and DIV[W] instructions per 1,000,000 32-bit additions.
  • [F,D,Q,Finx]: Floating-point ADD, MUL, DIV, and SQRT instructions per 1,000,000 32-bit additions

These performance measurements are only speculations on how vendors, researchers, and enthusiasts would measure performance, not something to specify in the standard.

Thoughts?

jrmoserbaltimore avatar Mar 28 '20 16:03 jrmoserbaltimore

It’s hard to nail this down in the spec, since it requires making microarchitectural assumptions. I’m OK with putting faith in core designers to have good intuition for which clock is most useful for the IPC-measurement purpose of RDCYCLE.

From a different angle: what should RDCYCLE return in an asynchronous design? In a software simulator? (Spike just has the cycle counter and instructions-retired counter return the same value, so that it appears to model a core with an IPC of 1.)

On Sat, Mar 28, 2020 at 9:00 AM jrmoserbaltimore [email protected] wrote:

The Counters 2.0 draft indicates six instructions: RDCYCLE[H], RDTIME[H], and RDINSTRET[H].

RDTIME[H] and RDINSTRET[H] are straightforward enough in all implementations. RDCYCLE[H] presents some issues in implementations using no global CPU clock. Individual instructions may have different clock rates in the ALU, or may use NULL Convention Logic or other method of delay-insensitive timing.

Some creative approximations, such as a count of RAM clock cycles or retired instructions plus retired pipeline bubbles, may hold some form of meaning; although a pipeline bubble is always a single bubble no matter how much time has passed. A measure of real-time by performing a NOP on start and measuring its real-time performance doesn't work: the amount of real-time required to execute an instruction increases when the CPU is hotter and decreases when the CPU is colder or given a higher Vcore.

Perhaps an appropriate definition of a clock cycle in relation to the cycle register would be:

  • One (1) global clock cycle if present; or
  • If no global clock, one (1) standard operation implemented to increment the cycle register continuously

In NCL-driven implementations, the latter would perform some operation and increment the cycle register at completion. The rate of cycle increments would increase or decrease as this executes. Implementation would need to ensure proper operation of RDCYCLE[H].

I am uncertain the merits of strictly-defining the standard operation, defining the standard operation to be identical to a certain ALU operation, or leaving the definition up to the implementation. I lean toward making it identical to a certain ALU operation, notably 32-bit integer addition.

Defining the standard operation to be identical to a certain ALU operation may avoid inflating instructions-per-clock performance measures by slowing down that operation, and as a side-effect makes such measurements meaningless.

This also has merit of allowing the implementation to make the space-speed trade-off: if cycle is defined as counting 32-bit integer addition (discarded) of 0xDEADBEEF and 0xCAFEF00D, an NCL implementation could increment the cycle register as a side-effect of calling the adder. The implementation could repeatedly perform this addition, which will stall the instruction pipeline when executing an ADD, but the instruction pipeline takes precedence and so executes immediately when the counter addition completes.

If this constant use of the adder raises the temperature of the adder, it will slow down; hence an implementation might opt to use a separate, identical adder to increment cycle so as to not impact the performance of the adder. Multi-hart or SMT implementations may simply rotate through all available adders to avoid heating up just one.

Such a definition would make "32-bit additions per real-time second at temperature" a chip performance measure, with one or more temperatures given. Possible measurements in extensions would include:

  • RV64I: 64-bit additions per 1,000,000 32-bit additions.
  • M: MUL[W] and DIV[W] instructions per 1,000,000 32-bit additions.
  • [F,D,Q,Finx]: Floating-point ADD, MUL, DIV, and SQRT instructions per 1,000,000 32-bit additions

These performance measurements are only speculations on how vendors, researchers, and enthusiasts would measure performance, not something to specify in the standard.

This still leaves a problem: removing the global clock dramatically reduces CPU power consumption; constantly running an adder would dramatically increase CPU power consumption.

As real CPUs can increase and decrease their clock rate, thus their cycle count per real-time increment, it may make much more sense to only operate this adder-loop when there are unretired instructions in any HART, and to cease its operation when no instructions are operating (e.g. when waiting for an interrupt). Idle CPU would thus have zero cycles passing.

Thoughts?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-isa-manual/issues/497, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH3XQSXVX2LYMPTOSXOTNLRJYNJJANCNFSM4LVTYSIA .

aswaterman avatar Mar 28 '20 20:03 aswaterman

To try to map rdcycle to an asynchronous design, I think we need to go back to basic principles and ask what rdcycle is used for. In general, I'd guess it is used primarily as a divisor for IPC - so pegging it to a canonical op (e.g. add) makes a bit of sense - except even canonical ops can have data, temperature, and voltage dependent timing in an asynch design. Trying to map what are basically analog timing numbers into the digital domain and having meaningful results may not be terribly useful, and likely not reproducible either.

On Sat, Mar 28, 2020 at 1:00 PM Andrew Waterman [email protected] wrote:

It’s hard to nail this down in the spec, since it requires making microarchitectural assumptions. I’m OK with putting faith in core designers to have good intuition for which clock is most useful for the IPC-measurement purpose of RDCYCLE.

From a different angle: what should RDCYCLE return in an asynchronous design? In a software simulator? (Spike just has the cycle counter and instructions-retired counter return the same value, so that it appears to model a core with an IPC of 1.)

On Sat, Mar 28, 2020 at 9:00 AM jrmoserbaltimore <[email protected]

wrote:

The Counters 2.0 draft indicates six instructions: RDCYCLE[H], RDTIME[H], and RDINSTRET[H].

RDTIME[H] and RDINSTRET[H] are straightforward enough in all implementations. RDCYCLE[H] presents some issues in implementations using no global CPU clock. Individual instructions may have different clock rates in the ALU, or may use NULL Convention Logic or other method of delay-insensitive timing.

Some creative approximations, such as a count of RAM clock cycles or retired instructions plus retired pipeline bubbles, may hold some form of meaning; although a pipeline bubble is always a single bubble no matter how much time has passed. A measure of real-time by performing a NOP on start and measuring its real-time performance doesn't work: the amount of real-time required to execute an instruction increases when the CPU is hotter and decreases when the CPU is colder or given a higher Vcore.

Perhaps an appropriate definition of a clock cycle in relation to the cycle register would be:

  • One (1) global clock cycle if present; or
  • If no global clock, one (1) standard operation implemented to increment the cycle register continuously

In NCL-driven implementations, the latter would perform some operation and increment the cycle register at completion. The rate of cycle increments would increase or decrease as this executes. Implementation would need to ensure proper operation of RDCYCLE[H].

I am uncertain the merits of strictly-defining the standard operation, defining the standard operation to be identical to a certain ALU operation, or leaving the definition up to the implementation. I lean toward making it identical to a certain ALU operation, notably 32-bit integer addition.

Defining the standard operation to be identical to a certain ALU operation may avoid inflating instructions-per-clock performance measures by slowing down that operation, and as a side-effect makes such measurements meaningless.

This also has merit of allowing the implementation to make the space-speed trade-off: if cycle is defined as counting 32-bit integer addition (discarded) of 0xDEADBEEF and 0xCAFEF00D, an NCL implementation could increment the cycle register as a side-effect of calling the adder. The implementation could repeatedly perform this addition, which will stall the instruction pipeline when executing an ADD, but the instruction pipeline takes precedence and so executes immediately when the counter addition completes.

If this constant use of the adder raises the temperature of the adder, it will slow down; hence an implementation might opt to use a separate, identical adder to increment cycle so as to not impact the performance of the adder. Multi-hart or SMT implementations may simply rotate through all available adders to avoid heating up just one.

Such a definition would make "32-bit additions per real-time second at temperature" a chip performance measure, with one or more temperatures given. Possible measurements in extensions would include:

  • RV64I: 64-bit additions per 1,000,000 32-bit additions.
  • M: MUL[W] and DIV[W] instructions per 1,000,000 32-bit additions.
  • [F,D,Q,Finx]: Floating-point ADD, MUL, DIV, and SQRT instructions per 1,000,000 32-bit additions

These performance measurements are only speculations on how vendors, researchers, and enthusiasts would measure performance, not something to specify in the standard.

This still leaves a problem: removing the global clock dramatically reduces CPU power consumption; constantly running an adder would dramatically increase CPU power consumption.

As real CPUs can increase and decrease their clock rate, thus their cycle count per real-time increment, it may make much more sense to only operate this adder-loop when there are unretired instructions in any HART, and to cease its operation when no instructions are operating (e.g. when waiting for an interrupt). Idle CPU would thus have zero cycles passing.

Thoughts?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-isa-manual/issues/497, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAH3XQSXVX2LYMPTOSXOTNLRJYNJJANCNFSM4LVTYSIA

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-isa-manual/issues/497#issuecomment-605511667, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHPXVJWPAFTLHJFPFVJP353RJZJOHANCNFSM4LVTYSIA .

allenjbaum avatar Mar 30 '20 17:03 allenjbaum