performance - Dependency chain analysis

Friday, September 28, 2018

performance - Dependency chain analysis

From Agner Fog's "Optimizing Assembly" guide, Section 12.7: a loop example. One of the paragraphs discussing the example code:

[...] Analysis for Pentium M: ... 13 uops at 3 per clock = one iteration per 4.33c retirement time.

There is a dependency chain in the loop. The latencies are: 2 for
memory read, 5 for multiplication, 3 for subtraction, and 3 for memory
write, which totals 13 clock cycles. This is three times as much as
the retirement time but it is not a loop-carried dependence because
the results from each iteration are saved to memory and not reused in

the next iteration. The out-of-order execution mechanism and
pipelining makes it possible that each calculation can start before
the preceding calculation is finished. The only loop-carried
dependency chain is add eax,16 which has a latency of only 1.

## Example 12.6b.  DAXPY algorithm, 32-bit mode
[...]   ; not shown: initialize some regs before the loop
L1:
    movapd xmm1, [esi+eax]   ; X[i], X[i+1]

    mulpd  xmm1, xmm2        ; X[i] * DA, X[i+1] * DA
    movapd xmm0, [edi+eax]   ; Y[i], Y[i+1]
    subpd  xmm0, xmm1        ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA
    movapd [edi+eax], xmm0   ; Store result
    add eax, 16              ; Add size of two elements to index
    cmp eax, ecx             ; Compare with n*8
    jl L1                    ; Loop back

I cannot understand why the dependency chain doesn't increase a whole throughput. I know that it is only important to find the worst bottleneck. The worst bottleneck identified before considering dependency chains was fused-domain uop throughput, at 4.33 cycles per iteration. I cannot understand why the dependency chain isn't a bigger bottleneck than that.

I see that the author explains that it is connected with out-of-order execution and pipelining but I cannot see it. I mean, though, only multiplication causes latency 5 cycles so only this value is greater than 4 cycle.

I also cannot understand why the author doesn't care about the dependency here:
add eax, 16 -> cmp eax, ecx -> jl L1
After all, addition must be executed before cmp and cmp must be executed before jl.

PS: later paragraphs identify the biggest bottleneck for Pentium M as decode, limiting it to one iteration per 6c, because 128b vector ops decode to two uops each. See Agner Fog's guide for the rest of the analysis, and analysis + tuning for Core2, FMA4 Bulldozer, and Sandybridge.

Answer

the mul isn't part of a loop-carried dependency chain, so there can be mulpd insns from multiple iterations in flight at once. The latency of a single instruction isn't the issue here at all, it's the dependency chain. Each iteration has a separate 13c dependency chain of load, mulpd, subpd, store. Out-of-order execution is what allows uops from multiple iterations to be in flight at once.

The cmp / jl in each iteration depend on the add from that iteration, but the add in the next iteration doesn't depend on the cmp. Speculative execution and branch prediction mean that control dependencies (conditional branches and indirect jumps/calls) are not part of data dependency chains. This is why instructions from one iteration can start running before the jl from the preceding iteration retires.

By comparison, cmov is a data dependency instead of a control dependency, so branchless loops tend to have loop-carried dependency chains. This tends to be slower than branching if the branch predicts well.

Each loop iteration has a separate cmp/jl dependency chain, just like the FP dependency chain.

I cannot understand why dependency chain doesn't increase a whole throughput.

I have no idea what this sentence means. I think I was able to figure out all your other mixed up words and phrasing. (e.g. "chain dependency" instead of "dependency chain".) Have a look at my edits to your question; some of them might help your understanding, too.

Blog

Friday, September 28, 2018