I first noticed in 2009 that GCC (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os
) instead of speed (-O2
or -O3
), and I have been wondering ever since why.
I have managed to create (rather silly) code that shows this surprising behavior and is sufficiently small to be posted here.
const int LOOP_BOUND = 200000000;
__attribute__((noinline))
static int add(const int& x, const int& y) {
return x + y;
}
__attribute__((noinline))
static int work(int xval, int yval) {
int sum(0);
for (int i=0; i int x(xval+sum);
int y(yval+sum);
int z = add(x, y);
sum += z;
}
return sum;
}
int main(int , char* argv[]) {
int result = work(*argv[1], *argv[2]);
return result;
}
If I compile it with -Os
, it takes 0.38 s to execute this program, and 0.44 s if it is compiled with -O2
or -O3
. These times are obtained consistently and with practically no noise (gcc 4.7.2, x86_64 GNU/Linux, Intel Core i5-3320M).
(Update: I have moved all assembly code to GitHub: They made the post bloated and apparently add very little value to the questions as the fno-align-*
flags have the same effect.)
Here is the generated assembly with -Os
and -O2
.
Unfortunately, my understanding of assembly is very limited, so I have no idea whether what I did next was correct: I grabbed the assembly for -O2
and merged all its differences into the assembly for -Os
except the .p2align
lines, result here. This code still runs in 0.38s and the only difference is the .p2align
stuff.
If I guess correctly, these are paddings for stack alignment. According to Why does GCC pad functions with NOPs? it is done in the hope that the code will run faster, but apparently this optimization backfired in my case.
Is it the padding that is the culprit in this case? Why and how?
The noise it makes pretty much makes timing micro-optimizations impossible.
How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source code?
UPDATE:
Following Pascal Cuoq's answer I tinkered a little bit with the alignments. By passing -O2 -fno-align-functions -fno-align-loops
to gcc, all .p2align
are gone from the assembly and the generated executable runs in 0.38s. According to the gcc documentation:
-Os enables all -O2 optimizations [but] -Os disables the following optimization flags:
-falign-functions -falign-jumps -falign-loops
-falign-labels -freorder-blocks -freorder-blocks-and-partition
-fprefetch-loop-arrays
So, it pretty much seems like a (mis)alignment issue.
I am still skeptical about -march=native
as suggested in Marat Dukhan's answer. I am not convinced that it isn't just interfering with this (mis)alignment issue; it has absolutely no effect on my machine. (Nevertheless, I upvoted his answer.)
UPDATE 2:
We can take -Os
out of the picture. The following times are obtained by compiling with
-O2 -fno-omit-frame-pointer
0.37s-O2 -fno-align-functions -fno-align-loops
0.37s-S -O2
then manually moving the assembly ofadd()
afterwork()
0.37s-O2
0.44s
It looks like to me the distance of add()
from the call site matters a lot. I have tried perf
, but the output of perf stat
and perf report
makes very little sense to me. However, I could only get one consistent result out of it:
-O2
:
602,312,864 stalled-cycles-frontend # 0.00% frontend cycles idle
3,318 cache-misses
0.432703993 seconds time elapsed
[...]
81.23% a.out a.out [.] work(int, int)
18.50% a.out a.out [.] add(int const&, int const&) [clone .isra.0]
[...]
¦ __attribute__((noinline))
¦ static int add(const int& x, const int& y) {
¦ return x + y;
100.00 ¦ lea (%rdi,%rsi,1),%eax
¦ }
¦ ? retq
[...]
¦ int z = add(x, y);
1.93 ¦ ? callq add(int const&, int const&) [clone .isra.0]
¦ sum += z;
79.79 ¦ add %eax,%ebx
For fno-align-*
:
604,072,552 stalled-cycles-frontend # 0.00% frontend cycles idle
9,508 cache-misses
0.375681928 seconds time elapsed
[...]
82.58% a.out a.out [.] work(int, int)
16.83% a.out a.out [.] add(int const&, int const&) [clone .isra.0]
[...]
¦ __attribute__((noinline))
¦ static int add(const int& x, const int& y) {
¦ return x + y;
51.59 ¦ lea (%rdi,%rsi,1),%eax
¦ }
[...]
¦ __attribute__((noinline))
¦ static int work(int xval, int yval) {
¦ int sum(0);
¦ for (int i=0; i
¦ int x(xval+sum);
8.20 ¦ lea 0x0(%r13,%rbx,1),%edi
¦ int y(yval+sum);
¦ int z = add(x, y);
35.34 ¦ ? callq add(int const&, int const&) [clone .isra.0]
¦ sum += z;
39.48 ¦ add %eax,%ebx
¦ }
For -fno-omit-frame-pointer
:
404,625,639 stalled-cycles-frontend # 0.00% frontend cycles idle
10,514 cache-misses
0.375445137 seconds time elapsed
[...]
75.35% a.out a.out [.] add(int const&, int const&) [clone .isra.0] ¦
24.46% a.out a.out [.] work(int, int)
[...]
¦ __attribute__((noinline))
¦ static int add(const int& x, const int& y) {
18.67 ¦ push %rbp
¦ return x + y;
18.49 ¦ lea (%rdi,%rsi,1),%eax
¦ const int LOOP_BOUND = 200000000;
¦
¦ __attribute__((noinline))
¦ static int add(const int& x, const int& y) {
¦ mov %rsp,%rbp
¦ return x + y;
¦ }
12.71 ¦ pop %rbp
¦ ? retq
[...]
¦ int z = add(x, y);
¦ ? callq add(int const&, int const&) [clone .isra.0]
¦ sum += z;
29.83 ¦ add %eax,%ebx
It looks like we are stalling on the call to add()
in the slow case.
I have examined everything that perf -e
can spit out on my machine; not just the stats that are given above.
For the same executable, the stalled-cycles-frontend
shows linear correlation with the execution time; I did not notice anything else that would correlate so clearly. (Comparing stalled-cycles-frontend
for different executables doesn't make sense to me.)
I included the cache misses as it came up as the first comment. I examined all the cache misses that can be measured on my machine by perf
, not just the ones given above. The cache misses are very very noisy and show little to no correlation with the execution times.
Answer
My colleague helped me find a plausible answer to my question. He noticed the importance of the 256 byte boundary. He is not registered here and encouraged me to post the answer myself (and take all the fame).
Short answer:
Is it the padding that is the culprit in this case? Why and how?
It all boils down to alignment. Alignments can have a significant impact on the performance, that is why we have the -falign-*
flags in the first place.
I have submitted a (bogus?) bug report to the gcc developers. It turns out that the default behavior is "we align loops to 8 byte by default but try to align it to 16 byte if we don't need to fill in over 10 bytes." Apparently, this default is not the best choice in this particular case and on my machine. Clang 3.4 (trunk) with -O3
does the appropriate alignment and the generated code does not show this weird behavior.
Of course, if an inappropriate alignment is done, it makes things worse. An unnecessary / bad alignment just eats up bytes for no reason and potentially increases cache misses, etc.
The noise it makes pretty much makes timing micro-optimizations
impossible.
How can I make sure that such accidental lucky / unlucky alignments
are not interfering when I do micro-optimizations (unrelated to stack
alignment) on C or C++ source codes?
Simply by telling gcc to do the right alignment:
g++ -O2 -falign-functions=16 -falign-loops=16
Long answer:
The code will run slower if:
an
XX
byte boundary cutsadd()
in the middle (XX
being machine dependent).if the call to
add()
has to jump over anXX
byte boundary and the target is not aligned.if
add()
is not aligned.if the loop is not aligned.
The first 2 are beautifully visible on the codes and results that Marat Dukhan kindly posted. In this case, gcc-4.8.1 -Os
(executes in 0.994 secs):
00000000004004fd <_ZL3addRKiS0_.isra.0>:
4004fd: 8d 04 37 lea eax,[rdi+rsi*1]
400500: c3
a 256 byte boundary cuts add()
right in the middle and neither add()
nor the loop is aligned. Surprise, surprise, this is the slowest case!
In case gcc-4.7.3 -Os
(executes in 0.822 secs), the 256 byte boundary only cuts into a cold section (but neither the loop, nor add()
is cut):
00000000004004fa <_ZL3addRKiS0_.isra.0>:
4004fa: 8d 04 37 lea eax,[rdi+rsi*1]
4004fd: c3 ret
[...]
40051a: e8 db ff ff ff call 4004fa <_ZL3addRKiS0_.isra.0>
Nothing is aligned, and the call to add()
has to jump over the 256 byte boundary. This code is the second slowest.
In case gcc-4.6.4 -Os
(executes in 0.709 secs), although nothing is aligned, the call to add()
doesn't have to jump over the 256 byte boundary and the target is exactly 32 byte away:
4004f2: e8 db ff ff ff call 4004d2 <_ZL3addRKiS0_.isra.0>
4004f7: 01 c3 add ebx,eax
4004f9: ff cd dec ebp
4004fb: 75 ec jne 4004e9 <_ZL4workii+0x13>
This is the fastest of all three. Why the 256 byte boundary is speacial on his machine, I will leave it up to him to figure it out. I don't have such a processor.
Now, on my machine I don't get this 256 byte boundary effect. Only the function and the loop alignment kicks in on my machine. If I pass g++ -O2 -falign-functions=16 -falign-loops=16
then everything is back to normal: I always get the fastest case and the time isn't sensitive to the -fno-omit-frame-pointer
flag anymore. I can pass g++ -O2 -falign-functions=32 -falign-loops=32
or any multiples of 16, the code is not sensitive to that either.
I first noticed in 2009 that gcc (at least on my projects and on my
machines) have the tendency to generate noticeably faster code if I
optimize for size (-Os) instead of speed (-O2 or -O3) and I have been
wondering ever since why.
A likely explanation is that I had hotspots which were sensitive to the alignment, just like the one in this example. By messing with the flags (passing -Os
instead of -O2
), those hotspots were aligned in a lucky way by accident and the code became faster. It had nothing to do with optimizing for size: These were by sheer accident that the hotspots got aligned better. From now on, I will check the effects of alignment on my projects.
Oh, and one more thing. How can such hotspots arise, like the one shown in the example? How can the inlining of such a tiny function like add()
fail?
Consider this:
// add.cpp
int add(const int& x, const int& y) {
return x + y;
}
and in a separate file:
// main.cpp
int add(const int& x, const int& y);
const int LOOP_BOUND = 200000000;
__attribute__((noinline))
static int work(int xval, int yval) {
int sum(0);
for (int i=0; i int x(xval+sum);
int y(yval+sum);
int z = add(x, y);
sum += z;
}
return sum;
}
int main(int , char* argv[]) {
int result = work(*argv[1], *argv[2]);
return result;
}
and compiled as: g++ -O2 add.cpp main.cpp
.
gcc won't inline add()
!
That's all, it's that easy to unintendedly create hotspots like the one in the OP. Of course it is partly my fault: gcc is an excellent compiler. If compile the above as: g++ -O2 -flto add.cpp main.cpp
, that is, if I perform link time optimization, the code runs in 0.19s!
(Inlining is artificially disabled in the OP, hence, the code in the OP was 2x slower).
No comments:
Post a Comment