performance - CPU bound vs Cache bound - Can instructions be executed without cache/memory access? Can memory access be as fast as instruction execution?

Friday, August 24, 2018

performance - CPU bound vs Cache bound - Can instructions be executed without cache/memory access? Can memory access be as fast as instruction execution?

I was looking up the difference between CPU bound and IO bound programs. That was when I came across answers that explain that there are other variants like Memory Bound, Cache bound, etc.

I understand how Memory Bound (Multiplication of 2 large matrices in Main Memory) and IO Bound (grep) differ from each other and from CPU bound/Cache bound.

However, the difference between CPU Bound programs and IO Bound programs doesn't seem as clear. Here is what I gathered :

Cache bound - Speed of cache access is an important factor in deciding the speed at which the program gets executed. For example, if the most visited part of a program is a small chunk of code inside a loop small enough to be contained within the cache, then the program may be cache bound.

CPU bound - The speed at which CPU executes instructions is an important factor in deciding the speed at which the program gets executed.

But how can processes be CPU bound? I mean, instructions need to be fetched before execution (from cache/ Main Memory) every time, so, no matter how fast the CPU is, it will have to wait for the cache to finish data transfer and thus will at least be Cache Bound or Memory bound, since memory access is slower than instruction execution.

So is CPU bound the same as cache bound?

Answer

CPU architecture is very much like plumbing, just without the smell. When one of the pipes gets clogged, some others will overflow, while others will remain empty - both cases are bad utilization, but you need to find the jam to release everything.
Similarly, with a CPU you have multiple systems that need to work in unison to make the program progress. Each of these machines has an upper limit on the bandwidth it can work, and when it's reached - it will become a limitation, making the other systems underutilized or even stalled.

The main memory for example depends on the number of channels and the type of DRAM (and of course frequency), but let's say it commonly peaks at 25G/s in client CPUs. that means that any workload that tries to consume data beyond this rate, will become blocked by the memory BW (i.e. memory bound), and the rest of the systems will be underutilized.

Cache BW depends on the cache level (and the processor micro-architecture, and of course frequency of that cache domain), but you can find out where it peaks in the optimization guides.

According to 2.1.3 here, Intel Skylake for example provides 2 32B loads + 1 store per cycle from the L1 (though the actual utilization they quote is a little lower, probably due to collisions or writeback interference), L2 is effectively about 1/2 line per cycle and L3 a little less than 1/3. This means that if your data set is contained in one of these levels, you can reach that peak BW before being capped by that cache.

On the other hand, let's say you don't reach the peak cache bandwidth, instead consuming data from the L1 at a lower rate, but each element of data requires many complicated mathematical operations. In that case, you may be bounded by your execution bandwidth - more so if these operations are limited to only part of the execution ports (as is the case with some esoteric operations).

There are useful tools to determine what you're bounded by - look up TopDown analysis for example

Blog

Friday, August 24, 2018