performance - Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

Monday, May 7, 2018

performance - Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

LOOP (Intel ref manual entry)
decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz already macro-fuses into a single uop on Sandybridge-family; the only difference being that that sets flags.

loop on various microarchitectures, from Agner Fog's instruction tables:

K8/K10: 7 m-ops

Bulldozer-family/Ryzen: 1 m-op (same cost as macro-fused test-and-branch, or jecxz)

P4: 4 uops (same as jecxz)

P6 (PII/PIII): 8 uops

Pentium M, Core2: 11 uops

Nehalem: 6 uops. (11 for loope / loopne). Throughput = 4c (loop) or 7c (loope/ne).

SnB-family: 7 uops. (11 for loope / loopne). Throughput = one per 5 cycles, as much of a bottleneck as keeping your loop counter in memory! jecxz is only 2 uops with same throughput as regular jcc

Silvermont: 7 uops

AMD Jaguar (low-power): 8 uops, 5c throughput

Via Nano3000: 2 uops

Couldn't the decoders just decode the same as lea rcx, [rcx-1] / jrcxz? That would be 3 uops. At least that would be the case with no address-size prefix, otherwise it has to use ecx and truncate RIP to EIP if the jump is taken; maybe the odd choice of address-size controlling the width of the decrement explains the many uops?

Or better, just decode it as a fused dec-and-branch that doesn't set flags? dec ecx / jnz on SnB decodes to a single uop (which does set flags).

I know that real code doesn't use it (because it's been slow since at least P5 or something), but AMD decided it was worth it to make it fast for Bulldozer. Probably because it was easy.

Would it be easy for SnB-family uarch to have fast loop? If so, why don't they? If not, why is it hard? A lot of decoder transistors? Or extra bits in a fused dec&branch uop to record that it doesn't set flags? What could those 7 uops be doing? It's a really simple instruction.

What's special about Bulldozer that made a fast loop easy / worth it? Or did AMD waste a bunch of transistors on making loop fast? If so, presumably someone thought it was a good idea.

If loop was fast, it would be perfect for BigInteger arbitrary-precision adc loops, to avoid partial-flag stalls / slowdowns (see my comments on my answer), or any other case where you want to loop without touching flags. It also has a minor code-size advantage over dec/jnz. (And dec/jnz only macro-fuses on SnB-family).

On modern CPUs where dec/jnz is ok in an ADC loop, loop would still be nice for ADCX / ADOX loops (to preserve OF).

If loop had been fast, compilers would already be using it as a peephole optimization for code-size + speed on CPUs without macro-fusion.

It wouldn't stop me from getting annoyed at all the questions with bad 16bit code that uses loop for every loop, even when they also need another counter inside the loop. But at least it wouldn't be as bad.

Blog

Monday, May 7, 2018