I was watching How I program C - YouTube video fairly recently. I mainly program in C++, but I generally like C, and wondered what techniques people are currently using. The video has some good ideas, and I believe in the authors aim for simplicity. One idea which was interesting was by developing in C, it's much easier to write tools to parse your code to do interesting things. Parsing C++ is pretty hard. Overall though I think C++ has too many advantages - from both the language and the tooling.
Anyway at the very end of the video he talks about Carmacks sqrt function. A clever way to calculate sqrt...
static Float _ApproximateSqrt(Float f) { union { Float32 f; Int32 i; } convert; convert.f = f; convert.i = 0x5f3759df - (convert.i >> 1); const Float32 x = f * 0.5f; Float32 y = convert.f; y = y * (1.5f - (x * y * y)); y = y * (1.5f - (x * y * y)); return f * y; }
This can be faster than using the standard libraries sqrtf. For a variety of reasons I thought that was very unlikely to be true anymore, so I thought I'd check it out against a variety of solutions.
My original test function was
float sum = 0; for (Int i = 0; i < size; i++) { sum += sqrtf(src[i] + sum ); }
I'll explain why I added the + sum later. Anyway the results are
Test | Time |
---|---|
Without Sse Sqrt | 18048 |
Without Sse Carmack Sqrt | 59626 |
With Sse Sqrt | 18043 |
With Sse Carmack Sqrt | 57422 |
Tests were all performed on i7 4770.
When I first compiled I didn't have + sum - I actually had much faster times, that was because VS 2015 update 3 vectorized the loop. Later I tried to repeat the result - but it refused to do so. Looking at the assembler it produced raised some questions...
00399B34 sqrtss xmm0,dword ptr [esi+ecx*4] 00399B39 inc ecx 00399B3A addss xmm0,dword ptr [esp+10h] 00399B40 movss dword ptr [esp+10h],xmm0 00399B46 movss dword ptr [esp+14h],xmm0 00399B4C cmp ecx,400h 00399B52 jl Re4::Maths32::AutoTest+0D4h (399B34h)
Eh? What are those two movss all about? The compiler must believe that the src pointer being able to alias over the variables on the stack. The thing I find most surprising is that at some point it decided they didn't alias. Anyway by informing the compiler that there is no aliasing with...
Float32* RE4_RESTRICT src = srcArr.begin();
Produces the fastest code with it auto vectorizing (the simpler form without + sum this is) to...
00C394F0 movups xmm0,xmmword ptr [edi+ecx*4] 00C394F4 add ecx,4 00C394F7 sqrtps xmm0,xmm0 00C394FA addps xmm1,xmm0 00C394FD cmp ecx,400h 00C39503 jl Re4::Maths32::AutoTest+0D0h (0C394F0h) 00C39505 mov ecx,dword ptr [Re4::Os::SystemTimer::g_singleton+8 (0EC7644h)] 00C3950B movaps xmm2,xmm1 00C3950E movhlps xmm2,xmm1 00C39511 addps xmm2,xmm1 00C39514 mov eax,dword ptr [ecx] 00C39516 movaps xmm0,xmm2 00C39519 mov eax,dword ptr [eax+20h] 00C3951C shufps xmm0,xmm2,0F5h 00C39520 addss xmm2,xmm0
Clever stuff. That clocks in at 1707 cycles(!).
So in conclusion Carmacks clever sqrt isn't a good idea on modern x86 processors. It's 1/3 of the speed of just using sqrtf(!). Moreover if you're careful your code might be auto vectorizable and in that scenario the compiler generates code that is 33x faster(!).
Update 1: Turns out although this function is attributed to Carmack he didn't write it. There's lots more information about it on the wikipedia page.