You are definitely getting what I call a cache resonance. This is similar to aliasing, but not exactly the same. Let me explain.
Caches are hardware data structures that extract one part of the address and use it as an index in a table, not unlike an array in software. (In fact, we call them arrays in hardware.) The cache array contains cache lines of data, and tags - sometimes one such entry per index in the array (direct mapped), sometimes several such (N-way set associativity). A second part of the address is extracted and compared to the tag stored in the array. Together, the index and tag uniquely identify a cache line memory address. Finally, the rest of the address bits identifies which bytes in the cache line are addressed, along with the size of the access.
Usually the index and tag are simple bitfields. So a memory address looks like
...Tag... | ...Index... | Offset_within_Cache_Line
(Sometimes the index and tag are hashes, e.g. a few XORs of other bits into the mid-range bits that are the index. Much more rarely, sometimes the index, and more rarely the tag, are things like taking cache line address modulo a prime number. These more complicated index calculations are attempts to combat the problem of resonance, which I explain here. All suffer some form of resonance, but the simplest bitfield extraction schemes suffer resonance on common access patterns, as you have found.)
So, typical values... there are many different models of "Opteron Dual Core", and I do not see anything here that specifies which one you have. Choosing one at random, the most recent manual I see on AMD's website, Bios and Kernel Developer's Guide (BKDG) for AMD Family 15h Models 00h-0Fh, March 12, 2012.
(Family 15h = Bulldozer family, the most recent high end processor - the BKDG mentions dual core, although I don't know the product number that is exactly what you describe. But, anyway, the same idea of resonance applies to all processors, it is just that the parameters like cache size and associativity may vary a bit.)
From p.33:
The AMD Family 15h processor contains a 16-Kbyte, 4-way predicted L1
data cache with two 128- bit ports. This is a write-through cache that
supports up to two 128 Byte loads per cycle. It is divided into 16
banks, each 16 bytes wide. [...] Only one load can be performed from a
given bank of the L1 cache in a single cycle.
To sum up:
64 byte cache line => 6 offset bits within cache line
16KB/4-way => the resonance is 4KB.
I.e. address bits 0-5 are the cache line offset.
16KB / 64B cache lines => 2^14/2^6 = 2^8=256 cache lines in the cache.
(Bugfix: I originally miscalculated this as 128. that I have fixed all dependencies.)4 way associative => 256/4 = 64 indexes in the cache array. I (Intel) call these "sets".
i.e. you can consider the cache to be an array of 32 entries or sets, each entry containing 4 cache lines ad their tags. (It's more complicated than this, but that's okay).
(By the way, the terms "set" and "way" have varying definitions.)
there are 6 index bits, bits 6-11 in the simplest scheme.
This means that any cache lines that have exactly the same values in the index bits, bits 6-11, will map to the same set of the cache.
Now look at your program.
C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];
Loop k is the innermost loop. The base type is double, 8 bytes. If dimension=2048, i.e. 2K, then successive elements of B[dimension*k+j]
accessed by the loop will be 2048 * 8 = 16K bytes apart. They will all map to the same set of the L1 cache - they will all have the same index in the cache. Which means that, instead of there being 256 cache lines in the cache available for use there will only be 4 - the "4-way associativity" of the cache.
I.e. you will probably get a cache miss every 4 iterations around this loop. Not good.
(Actually, things are a little more complicated. But the above is a good first understanding. The addresses of entries of B mentioned above is a virtual address. So there might be slightly different physical addresses. Moreover, Bulldozer has a way predictive cache, probably using virtual addresses bits so that it doesn't have to wait for a virtual to physical address translation. But, in any case: your code has a "resonance" of 16K. The L1 data cache has a resonance of 16K. Not good.)]
If you change the dimension just a little bit, e.g. to 2048+1, then the addresses of array B will be spread across all of the sets of the cache. And you will get significantly fewer cache misses.
It is a fairly common optimization to pad your arrays, e.g. to change 2048 to 2049, to avoid this srt of resonance. But "cache blocking is an even more important optimization. http://suif.stanford.edu/papers/lam-asplos91.pdf
In addition to the cache line resonance, there are other things going on here. For example, the L1 cache has 16 banks, each 16 bytes wide. With dimension = 2048, successive B accesses in the inner loop will always go to the same bank. So they can't go in parallel - and if the A access happens to go to the same bank, you will lose.
I don't think, looking at it, that this is as big as the cache resonance.
And, yes, possibly, there may be aliasing going. E.g. the STLF (Store To Load Forwarding buffers) may be comparing only using a small bitfield, and getting false matches.
(Actually, if you think about it, resonance in the cache is like aliasing, related to the use of bitfields. Resonance is caused by multiple cache lines mapping the same set, not being spread arond. Alisaing is caused by matching based on incomplete address bits.)
Overall, my recommendation for tuning:
Try cache blocking without any further analysis. I say this because cache blocking is easy, and it is very likely that this is all you would need to do.
After that, use VTune or OProf. Or Cachegrind. Or ...
Better yet, use a well tuned library routine to do matrix multiply.
No comments:
Post a Comment