Thursday, January 3, 2019

x86 - Why is string comparison so fast in python?



I became curious to understand the internals of how string comparison works in python when I was solving the following example algorithm problem:





Given two strings, return the length of the longest common prefix




Solution 1: charByChar



My intuition told me that the optimal solution would be to start with one cursor at the beginning of both words and iterate forward until the prefixes no longer match. Something like



def charByChar(smaller, bigger):
assert len(smaller) <= len(bigger)

for p in range(len(smaller)):
if smaller[p] != bigger[p]:
return p
return len(smaller)


To simplify the code, the function assumes that the length of the first string, smaller, is always smaller than or equal to the length of the second string, bigger.



Solution 2: binarySearch




Another method is to bisect the two strings to create two prefix substrings. If the prefixes are equal, we know that the common prefix point is at least as long as the midpoint. Otherwise the common prefix point is at least no bigger than the midpoint. We can then recurse to find the prefix length.



Aka binary search.



def binarySearch(smaller, bigger):
assert len(smaller) <= len(bigger)
lo = 0
hi = len(smaller)

# binary search for prefix

while lo < hi:
# +1 for even lengths
mid = ((hi - lo + 1) // 2) + lo

if smaller[:mid] == bigger[:mid]:
# prefixes equal
lo = mid
else:
# prefixes not equal
hi = mid - 1


return lo


At first I assumed that that binarySearch would be slower because string comparison would compare all characters several times rather than just the prefix characters as in charByChar.



Surpisingly, the binarySearch turned out to be much faster after some preliminary benchmarking.



Figure A




lcp_fixed_suffix



Above shows how performance is affected as prefix length is increased. Suffix length remains constant at 50 characters.



This graph shows two things:




  1. As expected, both algorithms perform linearly worse as prefix length increases.

  2. Performance of charByChar degrades at a much faster rate.




Why is binarySearch so much better? I think it is because





  1. The string comparison in binarySearch is presumably optimized by the interpreter / CPU behind the scenes.

  2. charByChar actually creates new strings for each character accessed and this produces significant overhead.





To validate this I benchmarked the performance of comparing and slicing a string, labelled cmp and slice respectively below.



Figure B



cmp



This graph show two important things:




  1. As expected, comparing and slicing increase linearly with length.


  2. The cost of comparing and slicing increase very slowly with length relative to algorithm performance, Figure A. Note both figures go up to strings of length 1 Billion characters. Therefore, the cost of comparing 1 character 1 Billion times is much much greater than comparing 1 Billion characters once. But this still doesn't answer why ...



Cpython



In an effort to find out how the cpython interpreter optimizes string comparison I generated the byte code for the following function.



In [9]: def slice_cmp(a, b): return a[0] == b[0]

In [10]: dis.dis(slice_cmp)

0 LOAD_FAST 0 (a)
2 LOAD_CONST 1 (0)
4 BINARY_SUBSCR
6 LOAD_FAST 1 (b)
8 LOAD_CONST 1 (0)
10 BINARY_SUBSCR
12 COMPARE_OP 2 (==)
14 RETURN_VALUE



I poked around the cpython code and found the following two pieces of code but I'm not sure this is where string comparison occurs.



The question





  • Where in the cpython does string comparison occur?

  • Is there a CPU optimization? Is there a special x86 instruction which does string comparison? How can I see what assembly instructions are generated by cpython? You may assume I am using python3 latest, Intel Core i5, OS X 10.11.6.

  • Why is comparing a long string so much faster than comparing each of it's characters?








Bonus question: When is charByChar more performant?



If the prefix is sufficiently small in comparison to the length rest of the string, at some point the cost of creating substrings in charByChar becomes less than the cost of comparing the substrings in binarySearch.



To describe this relationship I delved into runtime analysis.




Runtime analysis



To simplify the below equations, let's assume that smaller and bigger are the same size and I will refer to them as s1 and s2.



charByChar



charByChar(s1, s2) = costOfOneChar * prefixLen


Where the




costOfOneChar = cmp(1) + slice(s1Len, 1) + slice(s2Len, 1)


Where cmp(1) is the cost of comparing two strings of length 1 char.



slice is the cost of accessing a char, the equivalent of charAt(i). Python has immutable strings and accessing a char actually creates a new string of length 1. slice(string_len, slice_len) is the cost of slicing a string of length string_len to a slice of size slice_len.



So





charByChar(s1, s2) = O((cmp(1) + slice(s1Len, 1)) * prefixLen)



binarySearch



binarySearch(s1, s2) = costOfHalfOfEachString * log_2(s1Len)



log_2 is the number of times to divide the strings in half until reaching a string of length 1. Where



costOfHalfOfEachString = slice(s1Len, s1Len / 2) + slice(s2Len, s1Len / 2) + cmp(s1Len / 2)


So the big-O of binarySearch will grow according to




binarySearch(s1, s2) = O((slice(s2Len, s1Len) + cmp(s1Len)) * log_2(s1Len))




Based on our previous analysis of the cost of



If we assume that costOfHalfOfEachString is approximately the costOfComparingOneChar then we can refer to them both as x.



charByChar(s1, s2) = O(x * prefixLen)
binarySearch(s1, s2) = O(x * log_2(s1Len))



If we equate them



O(charByChar(s1, s2)) = O(binarySearch(s1, s2))
x * prefixLen = x * log_2(s1Len)
prefixLen = log_2(s1Len)
2 ** prefixLen = s1Len


So O(charByChar(s1, s2)) > O(binarySearch(s1, s2) when





2 ** prefixLen = s1Len



So plugging in the above formula I regenerated tests for Figure A but with strings of total length 2 ** prefixLen expecting the performance of the two algorithms to be roughly equal.



img



However, clearly charByChar performs much better. With a bit of trial and error, the performance of the two algorithms are roughly equal when s1Len = 200 * prefixLen




img



Why is the relationship 200x?


Answer



TL:DR: a slice compare is some Python overhead + a highly-optimized memcmp (unless there's UTF-8 processing?). Ideally, use slice compares to find the first mismatch to within less than 128 bytes or something, then loop a char at a time.



Or if it's an option and the problem is important, make a modified copy of an asm-optimized memcmp that returns the position of the first difference, instead of equal/not-equal; it will run as fast as a single == of the whole strings. Python has ways to call native C / asm functions in libraries.



It's a frustrating limitation that the CPU can do this blazingly fast, but Python doesn't (AFAIK) give you access to an optimized compare loop that tells you the mismatch position instead of just equal / greater / less.







It's totally normal that interpreter overhead dominates the cost of the real work in a simple Python loop, with CPython. Building an algorithm out of optimized building blocks is worth it even if it means doing more total work. This is why NumPy is good, but looping over a matrix element-by-element is terrible. The speed difference might be something like a factor of 20 to 100, for CPython vs. a compiled C (asm) loop for comparing one byte at a time (made up numbers, but probably right to within an order of magnitude).



Comparing blocks of memory for equality is probably one of the biggest mismatches between Python loops vs. operating on a whole list / slice. It's a common problem with highly-optimized solutions (e.g. most libc implementations (including OS X) have a manually-vectorized hand-coded asm memcmp that uses SIMD to compare 16 or 32 bytes in parallel, and runs much faster than a byte-at-a-time loop in C or assembly). So there's another factor of 16 to 32 (if memory bandwidth isn't a bottleneck) multiplying the factor of 20 to 100 speed difference between Python and C loops. Or depending on how optimized your memcmp is, maybe "only" 6 or 8 bytes per cycle.



With data hot in L2 or L1d cache for medium-sized buffers, it's reasonable to expect 16 or 32 bytes per cycle for memcmp on a Haswell or later CPU. (i3/i5/i7 naming started with Nehalem; i5 alone is not sufficient to tell us much about your CPU.)



I'm not sure if either or both of your comparisons are having to process UTF-8 and check for equivalency classes or different ways to encode the same character. The worst case is if your Python char-at-a-time loop has to check for potentially-multi-byte characters but your slice compare can just use memcmp.







Writing an efficient version in Python:



We're just totally fighting against the language to get efficiency: your problem is almost exactly the same as the C standard library function memcmp, except you want the position of the first difference instead of a - / 0 / + result telling you which string is greater. The search loop is identical, it's just a difference in what the function does after finding the result.



Your binary search is not the best way to use a fast compare building block. A slice compare still has O(n) cost, not O(1), just with a much smaller constant factor. You can and should avoid re-comparing the starts of the buffers repeatedly by using slices to compare large chunks until you find a mismatch, then go back over that last chunk with a smaller chunk size.



# I don't actually know Python; consider this pseudo-code

# or leave an edit if I got this wrong :P
chunksize = min(8192, len(smaller))
# possibly round chunksize down to the next lowest power of 2?
start = 0
while start+chunksize < len(smaller):
if smaller[start:start+chunksize] == bigger[start:start+chunksize]:
start += chunksize
else:
if chunksize <= 128:
return char_at_a_time(smaller[start:start+chunksize], bigger[start:start+chunksize])

else:
chunksize /= 8 # from the same start

# TODO: verify this logic for corner cases like string length not a power of 2
# and/or a difference only in the last character: make sure it does check to the end


I chose 8192 because your CPU has a 32kiB L1d cache, so the total cache footprint of two 8k slices is 16k, half your L1d. When the loop finds a mismatch, it will re-scan the last 8kiB in 1k chunks, and these compares will loop over data that's still hot in L1d. (Note that if == found a mismatch, it probably only touched data up to that point, not the whole 8k. But HW prefetch will keep going somewhat beyond that.)



A factor of 8 should be a good balance between using large slices to localize quickly vs. not needing many passes over the same data. This is a tunable parameter of course, along with chunk size. The bigger the mismatch between Python and asm, the smaller this factor should be to reduce Python loop iterations.)




Hopefully 8k is big enough to hide the Python loop / slice overhead; hardware prefetching should still be working during the Python overhead between memcmp calls from the interpreter so we don't need the granularity to be huge. But for really big strings, if 8k doesn't saturate memory bandwidth then maybe make it 64k (your L2 cache is 256kiB; i5 does tell us that much).



How exactly is memcmp so fast:




I am running this on Intel Core i5 but I would imagine I would get the same results on most modern CPUs.




Even in C, Why is memcmp so much faster than a for loop check? memcmp is faster than a byte-at-a-time compare loop, because even C compilers aren't great at (or totally incapable of) auto-vectorizing search loops.




Even without hardware SIMD support, an optimized memcmp could check 4 or 8 bytes at a time (word size / register width) even on a simple CPU without 16-byte or 32-byte SIMD.



But most modern CPUs, and all x86-64, have SIMD instructions. SSE2 is baseline for x86-64, and available as an extension in 32-bit mode.



An SSE2 or AVX2 memcmp can use pcmpeqb / pmovmskb to compare 16 or 32 bytes in parallel. (I'm not going to go into detail about how to write memcmp in x86 asm or with C intrinsics. Google that separately, and/or look up those asm instructions in an x86 instruction-set reference. like http://felixcloutier.com/x86/index.html. See also the x86 tag wiki for asm and performance links. e.g. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? has some info about single-core memory bandwidth limitations.)



I found an old version from 2005 of Apple's x86-64 memcmp (in AT&T syntax assembly language) on their opensource web site. It could definitely be better; for large buffers it should align one source pointer and only use movdqu for the other one, allowing movdqu then pcmpeqb with a memory operand instead of 2x movdqu, even if the strings are misaligned relative to each other. xorl $0xFFFF,%eax / jnz is also not optimal on CPUs where cmp/jcc can macro fuse but xor / jcc can't.



Unrolling to check a whole 64-byte cache line at once would also hide loop overhead. (This is the same idea of a large chunk and then looping back over it when you find a hit). Glibc's AVX2-movbe version does this with vpand to combine compare results in the main large-buffer loop, with the final combine being a vptest that also sets flags from the result. (Smaller code-size but no fewer uops than vpand/vpmovmskb/cmp/jcc; but no downside and maybe lower latency to reduce branch-mispredict penalties on loop exit). Glibc does dynamic CPU dispatching at dynamic link time; it picks this version on CPUs that support it.




Hopefully Apple's memcmp is better these days; I don't see source for it at all in the most recent Libc directory, though. Hopefully they dispatch at runtime to an AVX2 version for Haswell and later CPUs.



The LLoopOverChunks loop in the version I linked would only run at 1 iteration (16 bytes from each input) per ~2.5 cycles on Haswell; 10 fused-domain uops. But that's still much faster than 1 byte per cycle for a naive C loop, or much much worse than that for a Python loop.



Glibc's L(loop_4x_vec): loop is 18 fused-domain uops, and can thus run at just slightly less than 32 bytes (from each input) per clock cycle, when data is hot in L1d cache. Otherwise it will bottleneck on L2 bandwidth. It could have been 17 uops if they hadn't used an extra instruction inside the loop decrementing a separate loop counter, and calculated an end-pointer outside the loop.






Finding instructions / hot spots in the Python interpreter's own code





How could I drill down to find the C instructions and CPU instructions that my code calls?




On Linux you could run perf record python ... then perf report -Mintel to see which functions the CPU was spending the most time in, and which instructions in those functions were the hottest. You'll get results something like I posted here: Why is float() faster than int()?. (Drill down into any function to see the actual machine instructions that ran, shown as assembly language because perf has a disassembler built in.)



For a more nuanced view that samples the call-graph on each event, see linux perf: how to interpret and find hotspots.



(When you're looking to actually optimize a program, you want to know which function calls are expensive so you can try to avoid them in the first place. Profiling for just "self" time will find hot spots, but you won't always know which different callers caused a given loop to run most of the iterations. See Mike Dunlavey's answer on that perf question.)




But for this specific case, profiling the interpreter running a slice-compare version over big strings should hopefully find the memcmp loop where I think it will be spending most of its time. (Or for the char-at-a-time version, find the interpreter code that's "hot".)



Then you can directly see what asm instructions are in the loop. From the function names, assuming your binary has any symbols, you can probably find the source. Or if you have a version of Python built with debug info , you can get to the source directly from profile info. (Not a debug build with optimization disabled, just with full symbols).


No comments:

Post a Comment

plot explanation - Why did Peaches&#39; mom hang on the tree? - Movies &amp; TV

In the middle of the movie Ice Age: Continental Drift Peaches' mom asked Peaches to go to sleep. Then, she hung on the tree. This parti...