Neon/Sse optimization of Transcendental Functions

Posted on: 2016-02-14

This was a bit of a detour. A while back I had this weird bug on clang/ios where the linker couldn't find the sincosf function in the libs, and so to work around it I had to on that platform generate sin and cos separately. To do that I had to force a call to sinf twice. Yuk.

Also it always seemed inefficient to do sins and cos's for individual floats - surely SIMD could do better because it could calculate 4 sins/coss in a single go. So in my pointless quest for > 10x speed increases I thought I'd give it a go.

I looked around to see what other SIMD libraries were doing and eventually ended up looking at SLEEF (http://shibatch.sourceforge.net/). It's actually quite a nicely put together SIMD library - it's pretty clean, although having the types embedded in the methods seemed a little overkill.

So I rewrote Sin/SinCos (for Cos I just used Sin as Cos(x) -> Sin(x + PI/2)) - in my SIMD library which is currently Fpu/SSE/Arm Neon. The implementations may be a little sub optimal as I wrote the functions in the cross platform SIMD lib. I managed to simplify/optimize the original functions, by just using a shift for the handling of negation, and I didn't need the INF test that's incorporated into the original functions either.

So what's the results

Func	x86 Fpu	SSE	x86 Speedup	Arm Fpu	Arm Neon	Arm speedup
Sin	6622	526	12x	1379	112	12.3x
SinCos	10439	1049	10x	1630	125	13x
Sin/Cos					220	7.4x

Nice. It always makes me smile when I get > 10x speed up. Neon usually doesn't bring in such high speedups as SSE, but on this it was significantly faster. Tests were performed on an i7-4770 and ipad 2.

The Sin/Cos line is where I do a call separately to Sin and then Cos, as opposed to using the combined SinCos function. You'd expect SinCos to be faster, but actually on Sse it's about the same performance. On Neon, it does make a big difference, so in general it makes sense to use SinCos as you'd expect.

For your amusement here is the Sin method

#define RE4_FLOAT32_1_PI     0.318309886183790671538f  // 1/pi
#define RE4_FLOAT32_2_PI     0.636619772367581343076  // 2/pi

#define RE4_FLOAT32_PI4_A 0.78515625f
#define RE4_FLOAT32_PI4_B 0.00024187564849853515625f
#define RE4_FLOAT32_PI4_C 3.7747668102383613586e-08f
#define RE4_FLOAT32_PI4_D 1.2816720341285448015e-12f

/* static */VecF4 VecF4TranscendentalUtil::Sin(VecF4Prm in)
{
    VecF4 d = in;
    VecI4 q = F4RoundToI4(F4Mul(d, VecF4Init::Broadcast(RE4_FLOAT32_1_PI)));
    VecF4 u = I4ToF4(q);

    d = F4MulAdd(-RE4_FLOAT32_PI4_A * 4, u, d);
    d = F4MulAdd(-RE4_FLOAT32_PI4_B * 4, u, d);
    d = F4MulAdd(-RE4_FLOAT32_PI4_C * 4, u, d);
    d = F4MulAdd(-RE4_FLOAT32_PI4_D * 4, u, d);
    
    VecF4 s = F4Mul(d, d);
    d = I4ReinterpretF4(I4Xor(F4ReinterpretI4(d), I4ShiftLeft<31>(q)));

    u = F4MulAdd(2.6083159809786593541503e-06f, s, 
        VecF4Init::Broadcast(-0.0001981069071916863322258f));
    u = F4MulAdd(u, s, VecF4Init::Broadcast(0.00833307858556509017944336f));
    u = F4MulAdd(u, s, VecF4Init::Broadcast(-0.166666597127914428710938f));

    u = F4MulAdd(s, F4Mul(u, d), d);

    return u;
}

This code is using an update on the SIMD library style. It's not super pretty (operator overloading would be nice!). It's this way because I know all compilers can optimize this style well. The prefix of the method defines the type that is being worked on I4, B4, F4 for integer, bool and float. There is I2, B2, F2, and in the future if I try out Avx there will be I8, B8 and F8 types and functions.

You might hope that the function prefixes are not needed - because the function overloading would work it out based on the type. That doesn't work in my system though as in SSE the bool type is actually the same type as the float type underneath.