c - prefetching data at L1 and L2

Sunday, March 18, 2018

c - prefetching data at L1 and L2

In Agner Fog's manual Optimizing software in C++ in section 9.10 "Cahce contentions in large data structures" he describes a problem transposing a matrix when the matrix width is equal to something called the critical stride. In his test the cost for for a matrix in L1 is 40% greater when the width is equal to the critical stride. If the matrix is is even larger and only fits in L2 the cost is 600%! This is summed up nicely in Table 9.1 in his text. This is essential the same thing observed at

Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?

Later he writes:

The reason why this effect is so much stronger for
level-2 cache contentions than for level-1 cache contentions is that the level-2 cache cannot
prefetch more than one line at a time.

So my questions are related to prefetching data.

From his comment I infer that L1 can prefetch more than one cache line at a time. How many can it prefetch?

From what I understand trying to write code to prefetch the data (e.g. with _mm_prefetch) is rarely ever helpful. The only example I have read of is Prefetching Examples? and it's only a O(10%) improvement (on some machines). Agner later explains this:

The reason is that modern processors prefetch data automatically thanks to
out-of-order execution and advanced prediction mechanisms. Modern microprocessors are
able to automatically prefetch data for regular access patterns containing multiple streams

with different strides. Therefore, you don't have to prefetch data explicitly if data access can
be arranged in regular patterns with fixed strides.

So how does the CPU decide which data to prefetch and are there ways to help the CPU make better choices for the prefetching (e.g. "regular patterns with fixed strides")?

Edit: Based on a comment by Leeor let me add to my questions and make it more interesting. Why does the critical stride have so much more of an effect on L2 compared to L1?

Edit: I tried to reproduce Agner Fog's table using the code at Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
I ran this with MSVC2013 64-bit release mode on a Xeon E5 1620 (Ivy Bridge) which has L1 32KB 8-way, L2 256 KB 8-way, and L3 10MB 20-way. The max matrix size for L1 is about 90x90, 256x256 for L3, and 1619 for L3.

Matrix Size  Average Time
64x64        0.004251 0.004472 0.004412 (three times)
65x65        0.004422 0.004442 0.004632 (three times)
128x128      0.0409
129x129      0.0169
256x256      0.219   //max L2 matrix size
257x257      0.0692
512x512      2.701
513x513      0.649

1024x1024    12.8
1025x1025    10.1

I'm not seeing any performance loss in L1 however L2 clearly has the critical stride problem and maybe L3. I'm not sure yet why L1 does not show a problem. It's possible there is some other source of background (overhead) which is dominating the L1 times.

Blog

Sunday, March 18, 2018