Wednesday, July 25, 2018

c++ - Why is my hand-tuned, SSE-enabled code so slow?

Long story short: I'm developing a computing-intensive image processing application in C++. It needs to calculate many variants of image warps on small blocks of pixels extracted from larger images. The program doesn't run as fast as I would like. Profiling (OProfile) showed the warping/interpolation function consumes more than 70% of the CPU time so it seemed obvious to try and optimize it.



I was using the OpenCV image processing library for the task until now:



// some parameters for the image warps (position, stretch, skew)

struct WarpParams;

void Image::get(const WarpParams ¶ms)
{
// fills matrices mapX_ and mapY_ with x and y coordinates of points to be
// inteprolated.
updateCoordMaps(params);
// perform interpolation to obtain pixels at point locations
// P(mapX_[i], mapY_[i]) based on the original data and put the
// result in pixels_. Use bicubic inteprolation.

cv::remap(image_->data(), pixels_, mapX_, mapY_, CV_INTER_CUBIC);
}


I wrote my own interpolation function and put it in a test harness to ensure correctness while I experiment and to benchmark it in relation to the old one.



My function ran very slow which was to be expected. Generally, the idea is to:




  1. Iterate over the mapX_, mapY_ coordinate maps, extract (real-valued) coordinates of the next pixel to be interpolated;


  2. Retrieve a 4x4 neighborough of pixel values (integer coordinates) from the original image surrounding the interpolated pixel;

  3. Calculate the coefficients of the convolution kernel for each of these 16 pixels;

  4. Calculate the value of the interpolated pixel as a linear combination of the 16 pixel values and the kernel coefficients.



The old function timed at 25us on my Wolfdale Core2 Duo. The new one took 587us (!). I eagerly put my wizard hat on and started hacking the code. I managed to remove all branches, omit some duplicating calculations and transform 3 nested loops into just one over the coordinate maps. This is what I came up with:



void Image::getConvolve(const WarpParams ¶ms)
{
__declspec(align(16)) static float kernelX[4], kernelY[4];


// grab pointers to coordinate map matrices and the original image
const float
*const mapX = mapX_.ptr(),
*const mapY = mapY_.ptr(),
*const img = image_->data().ptr();
// grab pointer to the output image
float *const subset = pixels_.ptr(),
x, y, xint, yint;


const ptrdiff_t imgw = image_->width();
ptrdiff_t imgoffs;

__m128 v_px, v_kernX, v_kernY, v_val;

// iterate over the coordinate matrices as linear buffers
for (size_t idx = 0; idx < pxCount; ++idx)
{
// retrieve coordinates of next pixel from precalculated maps,
// break up each into fractional and integer part

x = modf(mapX[idx], &xint);
y = modf(mapY[idx], &yint);
// obtain offset of the top left pixel from the required 4x4
// neighborhood of the current pixel in the image's
// buffer (sadly, the position will be unaligned)
imgoffs = (((ptrdiff_t)yint - 1) * imgw) + (ptrdiff_t)xint - 1;

// calculate all 4 convolution kernel values for every row and
// every column
tap4Kernel(x, kernelX);

tap4Kernel(y, kernelY);

// load the kernel values for the columns, these don't change
v_kernX = _mm_load_ps(kernelX);

// process a row of the 4x4 neighborhood
// get set of convolution kernel values for the current row
v_kernY = _mm_set_ps1(kernelY[0]);
v_px = _mm_loadu_ps(img + imgoffs); // load the pixel values
// calculate the linear combination of the pixels with kernelX

v_px = _mm_mul_ps(v_px, v_kernX);
v_px = _mm_mul_ps(v_px, v_kernY); // and kernel Y
v_val = v_px; // add result to the final value
imgoffs += imgw;
// offset points now to next row of the 4x4 neighborhood

v_kernY = _mm_set_ps1(kernelY[1]);
v_px = _mm_loadu_ps(img + imgoffs);
v_px = _mm_mul_ps(v_px, v_kernX);
v_px = _mm_mul_ps(v_px, v_kernY);

v_val = _mm_add_ps(v_val, v_px);
imgoffs += imgw;

/*... same for kernelY[2] and kernelY[3]... */

// store resulting interpolated pixel value in the subset's
// pixel matrix
subset[idx] = horizSum(v_val);
}
}


// Calculate all 4 values of the 4-tap convolution kernel for 4 neighbors
// of a pixel and store them in an array. Ugly but fast.
// The "arg" parameter is the fractional part of a pixel's coordinate, i.e.
// a number in the range <0,1)
void Image::tap4Kernel(const float arg, float *out)
{
// chaining intrinsics was slower, so this is done in separate steps
// load the argument into 4 cells of a XMM register
__m128

v_arg = _mm_set_ps1(arg),
v_coeff = _mm_set_ps(2.0f, 1.0f, 0.0f, -1.0f);

// subtract vector of [-1, 0, 1, 2] to obtain coorinates of 4 neighbors
// for kernel calculation
v_arg = _mm_sub_ps(v_arg, v_coeff);
// clear sign bits, this is equivalent to fabs() on all 4
v_coeff = _mm_set_ps1(-0.f);
v_arg = _mm_andnot_ps(v_coeff, v_arg);


// calculate values of abs(argument)^3 and ^2
__m128
v_arg2 = _mm_mul_ps(v_arg, v_arg),
v_arg3 = _mm_mul_ps(v_arg2, v_arg),
v_val, v_temp;

// calculate the 4 kernel values as
// arg^3 * A + arg^2 * B + arg * C + D, using
// (A,B,C,D) = (-0.5, 2.5, -4, 2) for the outside pixels and
// (1.5, -2.5, 0, 1) for inside

v_coeff = _mm_set_ps(-0.5f, 1.5f, 1.5f, -0.5f);
v_val = _mm_mul_ps(v_coeff, v_arg3);
v_coeff = _mm_set_ps(2.5f, -2.5f, -2.5f, 2.5f);
v_temp = _mm_mul_ps(v_coeff, v_arg2);
v_val = _mm_add_ps(v_val, v_temp);
v_coeff = _mm_set_ps(-4.0f, 0.0f, 0.0f, -4.0f),
v_temp = _mm_mul_ps(v_coeff, v_arg);
v_val = _mm_add_ps(v_val, v_temp);
v_coeff = _mm_set_ps(2.0f, 1.0f, 1.0f, 2.0f);
v_val = _mm_add_ps(v_val, v_coeff);


_mm_store_ps(out, v_val);
}


I was pleased to have managed to get the run time on this to below 40us, even before introducing SSE to the main loop, which I saved for last. I was expecting at least a 3-fold speedup but it ran only barely faster at 36us, slower than the old get() which I was trying to improve upon. Even worse was the fact that when I changed the benchmark loop to do more runs, the old function had the same mean run time, while mine stretched to over 127us, meaning it takes even longer for some extreme warp parameter values (makes sense because more warps means I need to reach for widely-dispersed pixel values from the original image to calculate the result).



I figured the reason must be the unaligned loads, but that can't be helped (I need to reach for unpredictable pixel values). I couldn't see anything more I could do in the optimizing department so I decided to look at the cv::remap() function to see how they do it. Imagine my surprise at finding it contain a mess of nested loops and plenty of branches. They also do a lot of argument verification which I didn't need to bother with. As far as I can tell (no comments in the code), SSE (with unaligned loads as well!) is only used for extracting the values from the coordinate maps and rounding them into integers, then a function is called that does the actual interpolating with regular float arithmetics.



My question is, why did I fail so miserably (why is my code so slow and why is theirs faster even though it looks like a mess) and what can I do to improve my code?




I'm not pasting the OpenCV code here because this is already too long, you can check it out at pastebin.



I tested and compiled my code in Release mode under VC++2010. The OpenCV used was a precompiled binary bundle of v2.3.1.



EDIT: The pixel values are floats in the range 0..1. Profiling showed the tap4Kernel() function wasn't relevant, most time is spent inside getConvolve().



EDIT2: I pasted the disassembly of the generated code to pastebin. This is compiled on an old Banias Celeron processor (has SSE2), but looks more or less the same.



EDIT3: After reading What Every Programmer Should Know About Memory I realized I was incorrectly assuming the OpenCV function implements more or less the same algorithm as I did, which must not be the case. For every pixel I interpolate, I need to retrieve its 4x4 neighborhood, whose pixels are non-sequentially placed inside the image buffer. I am misusing the CPU caches, and OpenCV probably doesn't. VTune profiling would seem to agree as my function has 5,800,000 memory accesses, while OpenCV does only 400,000. Their function is a mess and could probably be further optimized but it still manages to have an edge over me, probably due to some smarter approach to memory and cache usage.




UPDATE: I managed to improve the way pixel values are loaded into XMM registers. I allocate a buffer in the object which holds 16-element cells for every pixel of the image. At image load, I fill this cell buffer with pre-arranged sequences of 4x4-neighborhoods for every pixel. Not very space-efficient (image takes 16x the space), but that way, the loads are always aligned (no more _mm_loadu_ps()), and I am avoiding having to do scattered reads of pixels from the image buffer, since the required pixels are stored sequentially. To my surprise, there was hardly any improvement at all. I heard unaligned loads could be 10x slower, but clearly this is not the problem here. But by commenting out parts of the code, I found out that the modf() calls are responsible for 75% of the runtime! I'll focus on eliminating those and post an answer.

No comments:

Post a Comment

plot explanation - Why did Peaches&#39; mom hang on the tree? - Movies &amp; TV

In the middle of the movie Ice Age: Continental Drift Peaches' mom asked Peaches to go to sleep. Then, she hung on the tree. This parti...