sse - atan2 approximation with 11bits in mantissa on x86(with SSE2) and ARM(with vfpv4 NEON) -


i'm trying implement fast atan2(float) 11 bits accuracy in mantissa. atan2 implementation used image processing. might better implemented simd instruction(the impl targeting x86(with sse2) & arm(with vpfv4 neon)).

for now, use chebyshev polynomial approximation(https://jp.mathworks.com/help/fixedpoint/examples/calculate-fixed-point-arctangent.html).

i'm willing implement vectorized code manually. use sse2(or later) & neon wrapper library. plan or tried these implementation option

  • vectorized polynomial approximation
    • chebyshev polynomial(now implemented)
  • scalar table ( + linear interpolation)
  • vectorized table (is possible? i'm not familiar neon, don' know instructions vgather in neon)
  • vectorized cordic method (this may slow since require 12+ rotation iterations 11bits accuracy)

else idea?

the first thing want check whether compiler able vectorize atan2f (y,x) when applied array of floats. requires @ least high optimization level such -o3 , possibly specifying relaxed "fast math" mode, in errno handling, denormals , special inputs such infinities , nans largely ignored. approach, accuracy may in excess of required, may hard beat tuned library implementation respect performance.

the next thing try write simple scalar implementation sufficient accuracy, , have compiler vectorize it. typically means avoiding simple branches can converted branchless code through if-conversion. example of such code fast_atan2f() shown below. intel compiler, invoked icl /o3 /fp:precise /qvec_report=2 my_atan2f.c, vectorized successfully: my_atan2f.c(67): (col. 9) remark: loop vectorized. double checking generated code through disassembly shows fast_atan2f() has been inlined , vectorized using sse instructions of *ps flavor.

if else fails, translate code of fast_atan2() platform-specific simd intrinsics hand, should not hard do. not have sufficient experience though.

i have used simple algorithm in code, simple argument reduction produce reduced argument in [0,1], followed minimax polynomial approximation, , final step mapping result full circle. core approximation generated remez algorithm , evaluated using 2nd-order horner scheme. fused-multiply add (fma) can used available. special cases involving infinities, nans, or 0/0 not handled, sake of performance.

#include <stdio.h> #include <stdlib.h> #include <math.h>  /* maximum relative error 3.6e-5 */ float fast_atan2f (float y, float x) {     float a, r, s, t, c, q, ax, ay, mx, mn;     ax = fabsf (x);     ay = fabsf (y);     mx = fmaxf (ay, ax);     mn = fminf (ay, ax);     = mn / mx;     /* minimax polynomial approximation atan(a) on [0,1] */     s = * a;     c = s * a;     q = s * s;     r =  0.024840285f * q + 0.18681418f;     t = -0.094097948f * q - 0.33213072f;     r = r * s + t;     r = r * c + a;     /* map full circle */     if (ay > ax) r = 1.57079637f - r;     if (x <   0) r = 3.14159274f - r;     if (y <   0) r = -r;     return r; }  /* fixes via: greg rose, kiss: bit simple. http://eprint.iacr.org/2011/007 */ static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789; #define znew (z=36969*(z&0xffff)+(z>>16)) #define wnew (w=18000*(w&0xffff)+(w>>16)) #define mwc  ((znew<<16)+wnew) #define shr3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */ #define cong (jcong=69069*jcong+13579)                     /* 2^32 */ #define kiss ((mwc^cong)+shr3)  float rand_float(void) {     volatile union {         float f;         unsigned int i;     } cvt;     {         cvt.i = kiss;     } while (isnan(cvt.f) || isinf (cvt.f) || (fabsf (cvt.f) < powf (2.0f, -126)));     return cvt.f; }  int main (void) {     const int n = 10000;     const int m = 100000;     float ref, relerr, maxrelerr = 0.0f;     float arga[n], argb[n], res[n];     int i, j;      printf ("testing atan2() %d test vectors\n", n*m);      (j = 0; j < m; j++) {         (i = 0; < n; i++) {             arga[i] = rand_float();             argb[i] = rand_float();         }          // loop should vectorized         (i = 0; < n; i++) {             res[i] = fast_atan2f (argb[i], arga[i]);         }          (i = 0; < n; i++) {             ref = (float) atan2 ((double)argb[i], (double)arga[i]);             relerr = (res[i] - ref) / ref;             if ((fabsf (relerr) > maxrelerr) &&                  (fabsf (ref) >= powf (2.0f, -126))) { // result not denormal                 maxrelerr = fabsf (relerr);             }         }     };      printf ("max rel err = % 15.8e\n\n", maxrelerr);      printf ("atan2(1,0)  = % 15.8e\n", fast_atan2f(1,0));     printf ("atan2(-1,0) = % 15.8e\n", fast_atan2f(-1,0));     printf ("atan2(0,1)  = % 15.8e\n", fast_atan2f(0,1));     printf ("atan2(0,-1) = % 15.8e\n", fast_atan2f(0,-1));     return exit_success; } 

the output of program above should similar following:

testing atan2() 1000000000 test vectors max rel err =  3.53486939e-005  atan2(1,0)  =  1.57079637e+000 atan2(-1,0) = -1.57079637e+000 atan2(0,1)  =  0.00000000e+000 atan2(0,-1) =  3.14159274e+000 

Comments

Popular posts from this blog

ZeroMQ on Windows, with Qt Creator -

unity3d - Unity SceneManager.LoadScene quits application -

python - Error while using APScheduler: 'NoneType' object has no attribute 'now' -