Saturday, March 31, 2018

c - Fastest way to count bits





Possible Duplicate:
How to count the number of set bits in a 32-bit integer?







Give a unsigned char type value,count the total bits in it.What's the fastest way?
I wrote three function as below,what's the best way,and can someone come up with a faster one?(I just want the extremely fast one)



const int tbl[] =
{
#define B2(n) n, n+1, n+1, n+2
#define B4(n) B2(n), B2(n+1), B2(n+1), B2(n+2)
#define B6(n) B4(n), B4(n+1), B4(n+1), B4(n+2)

B6(0), B6(1), B6(1), B6(2)
};

char naivecount (unsigned char val)
{
char cnt = 0;
while (val)
{
cnt += (val & 1);
val = val >> 1;

}
return cnt;
}

inline tableLookUp(int val)
{
assert(val >= 0 && val <= 255);
return tbl[val];
}


int asmCount(int val)
{
int res = 0;
asm volatile("xor %0, %0\n\t"
"begin:\n\t"
"cmp $0x0, %1\n\t"
"jle end\n\t"
"movl %1, %%ecx\n\t"
"and $0x1, %%ecx\n\t"
"addl %%ecx, %0\n\t"

"shrl %1\n\t"
"jmp begin\n\t"
"end:"
: "=r"(res)
: "r" (val));
return res;
}


EDIT:




I have test all the method,the fastest one is to use the popcntl instruction.In platform without the instruction,I will use table look-up.


Answer



If you want to code it by hand, try this:



#include 

int popcnt8(uint8_t x) {

x = (x & 0x55) + (x >> 1 & 0x55);

x = (x & 0x33) + (x >> 2 & 0x33);
x = (x & 0x0f) + (x >> 4 & 0x0f);

return x;
}


on x86, this compiles to (AT&T-syntax):



popcnt8:

movl %edi, %eax
shrb %dil
andl $85, %eax
andl $85, %edi
addl %eax, %edi
movl %edi, %eax
shrb $2, %dil
andl $51, %eax
andl $51, %edi
addl %eax, %edi

movl %edi, %eax
shrb $4, %dil
andl $15, %eax
addl %edi, %eax
movzbl %al, %eax
ret


Compare this to what gcc generates with the intrinsic:




#include 

int popcnt8_intrin(uint8_t x) { return __builtin_popcount(x); }


On x86 with SSE 4.2:



popcnt8_intrin:
movzbl %dil, %eax
popcntl %eax, %eax

ret


which is not optimal; clang generates:



popcnt8_intrin:
popcntl %edi,%eax
ret



reducing the calculation to one (!) instruction.



On x86 without SSE 4.2:



popcnt8_intrin:
subq $8, %rsp
movzbl %dil, %edi
call __popcountdi2
addq $8, %rsp
ret



gcc essentially calls its library here. Not quite optimal. clang does a little better:



popcnt8_intrin:                         # @popcnt8_intrin
movl %edi, %eax
shrl %eax
andl $85, %eax
subl %eax, %edi
movl %edi, %eax

andl $858993459, %eax # imm = 0x33333333
shrl $2, %edi
andl $858993459, %edi # imm = 0x33333333
addl %eax, %edi
movl %edi, %eax
shrl $4, %eax
addl %edi, %eax
andl $252645135, %eax # imm = 0xF0F0F0F
imull $16843009, %eax, %eax # imm = 0x1010101
shrl $24, %eax

ret


clang calculates popcnt for a whole 32 bit number. This is not optimal imho.


No comments:

Post a Comment

plot explanation - Why did Peaches&#39; mom hang on the tree? - Movies &amp; TV

In the middle of the movie Ice Age: Continental Drift Peaches' mom asked Peaches to go to sleep. Then, she hung on the tree. This parti...