Carmacks Sqrt

Posted on: 2016-12-08

I was watching How I program C - YouTube video fairly recently. I mainly program in C++, but I generally like C, and wondered what techniques people are currently using. The video has some good ideas, and I believe in the authors aim for simplicity. One idea which was interesting was by developing in C, it's much easier to write tools to parse your code to do interesting things. Parsing C++ is pretty hard. Overall though I think C++ has too many advantages - from both the language and the tooling.

Anyway at the very end of the video he talks about Carmacks sqrt function. A clever way to calculate sqrt...

static Float _ApproximateSqrt(Float f)
{
	union
	{
		Float32 f;
		Int32 i;
	} convert;

	convert.f = f;
	convert.i = 0x5f3759df - (convert.i >> 1);

	const Float32 x = f * 0.5f;
	Float32 y = convert.f;

	y = y * (1.5f - (x * y * y));
	y = y * (1.5f - (x * y * y));
	return f * y;
}

This can be faster than using the standard libraries sqrtf. For a variety of reasons I thought that was very unlikely to be true anymore, so I thought I'd check it out against a variety of solutions.

My original test function was

float sum = 0;

for (Int i = 0; i < size; i++)
{
	sum += sqrtf(src[i] + sum );
}

I'll explain why I added the + sum later. Anyway the results are

Test	Time
Without Sse Sqrt	18048
Without Sse Carmack Sqrt	59626
With Sse Sqrt	18043
With Sse Carmack Sqrt	57422

Tests were all performed on i7 4770.

When I first compiled I didn't have + sum - I actually had much faster times, that was because VS 2015 update 3 vectorized the loop. Later I tried to repeat the result - but it refused to do so. Looking at the assembler it produced raised some questions...

00399B34  sqrtss      xmm0,dword ptr [esi+ecx*4]
00399B39  inc         ecx
00399B3A  addss       xmm0,dword ptr [esp+10h]  
00399B40  movss       dword ptr [esp+10h],xmm0  
00399B46  movss       dword ptr [esp+14h],xmm0  
00399B4C  cmp         ecx,400h  
00399B52  jl          Re4::Maths32::AutoTest+0D4h (399B34h)

Eh? What are those two movss all about? The compiler must believe that the src pointer being able to alias over the variables on the stack. The thing I find most surprising is that at some point it decided they didn't alias. Anyway by informing the compiler that there is no aliasing with...

Float32* RE4_RESTRICT src = srcArr.begin();

Produces the fastest code with it auto vectorizing (the simpler form without + sum this is) to...

00C394F0  movups      xmm0,xmmword ptr [edi+ecx*4]  
00C394F4  add         ecx,4  
00C394F7  sqrtps      xmm0,xmm0  
00C394FA  addps       xmm1,xmm0  
00C394FD  cmp         ecx,400h  
00C39503  jl          Re4::Maths32::AutoTest+0D0h (0C394F0h)  
00C39505  mov         ecx,dword ptr [Re4::Os::SystemTimer::g_singleton+8 (0EC7644h)]  
00C3950B  movaps      xmm2,xmm1  
00C3950E  movhlps     xmm2,xmm1  
00C39511  addps       xmm2,xmm1  
00C39514  mov         eax,dword ptr [ecx]  
00C39516  movaps      xmm0,xmm2  
00C39519  mov         eax,dword ptr [eax+20h]  
00C3951C  shufps      xmm0,xmm2,0F5h  
00C39520  addss       xmm2,xmm0

Clever stuff. That clocks in at 1707 cycles(!).

So in conclusion Carmacks clever sqrt isn't a good idea on modern x86 processors. It's 1/3 of the speed of just using sqrtf(!). Moreover if you're careful your code might be auto vectorizable and in that scenario the compiler generates code that is 33x faster(!).

Update 1: Turns out although this function is attributed to Carmack he didn't write it. There's lots more information about it on the wikipedia page.