Friday, March 30, 2018

c++ - Why is my program slow when looping over exactly 8192 elements?

The following tests have been done with Visual C++ compiler as it is used by the default Qt Creator install (I guess with no optimization flag). When using GCC, there is no big difference between Mystical's version and my "optimized" code. So the conclusion is that compiler optimizations take care off micro optimization better than humans (me at last). I leave the rest of my answer for reference.




It's not efficient to process images this way. It's better to use single dimension arrays. Processing all pixels is the done in one loop. Random access to points could be done using:


pointer + (x + y*width)*(sizeOfOnePixel)

In this particular case, it's better to compute and cache the sum of three pixels groups horizontally because they are used three times each.


I've done some tests and I think it's worth sharing. Each result is an average of five tests.


Original code by user1615209:


8193: 4392 ms
8192: 9570 ms

Mystical's version:


8193: 2393 ms
8192: 2190 ms

Two pass using a 1D array: first pass for horizontal sums, second for vertical sum and average.
Two pass addressing with three pointers and only increments like this:


imgPointer1 = &avg1[0][0];
imgPointer2 = &avg1[0][SIZE];
imgPointer3 = &avg1[0][SIZE+SIZE];
for(i=SIZE;i resPointer[i]=(*(imgPointer1++)+*(imgPointer2++)+*(imgPointer3++))/9;
}
8193: 938 ms
8192: 974 ms

Two pass using a 1D array and addressing like this:


for(i=SIZE;i    resPointer[i]=(hsumPointer[i-SIZE]+hsumPointer[i]+hsumPointer[i+SIZE])/9;
}
8193: 932 ms
8192: 925 ms

One pass caching horizontal sums just one row ahead so they stay in cache:


// Horizontal sums for the first two lines
for(i=1;i hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
}
// Rest of the computation
for(;i // Compute horizontal sum for next line
hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
// Final result
resPointer[i-SIZE]=(hsumPointer[i-SIZE-SIZE]+hsumPointer[i-SIZE]+hsumPointer[i])/9;
}
8193: 599 ms
8192: 652 ms

Conclusion:



  • No benefits of using several pointers and just increments (I thought it would have been faster)

  • Caching horizontal sums is better than computing them several time.

  • Two pass is not three times faster, two times only.

  • It's possible to achieve 3.6 times faster using both a single pass and caching an intermediary result


I'm sure it's possible to do much better.


NOTE
Please, note that I wrote this answer to target general performance issues rather than the cache problem explained in Mystical's excellent answer. At the beginning it was just pseudo code. I was asked to do tests in the comments... Here is a completely refactored version with tests.

No comments:

Post a Comment

plot explanation - Why did Peaches' mom hang on the tree? - Movies & TV

In the middle of the movie Ice Age: Continental Drift Peaches' mom asked Peaches to go to sleep. Then, she hung on the tree. This parti...