sse - atan2 approximation with 11bits in mantissa on x86(with SSE2) and ARM(with vfpv4 NEON) -
i'm trying implement fast atan2(float) 11 bits accuracy in mantissa. atan2 implementation used image processing. might better implemented simd instruction(the impl targeting x86(with sse2) & arm(with vpfv4 neon)).
for now, use chebyshev polynomial approximation(https://jp.mathworks.com/help/fixedpoint/examples/calculate-fixed-point-arctangent.html).
i'm willing implement vectorized code manually. use sse2(or later) & neon wrapper library. plan or tried these implementation option
- vectorized polynomial approximation
- chebyshev polynomial(now implemented)
- scalar table ( + linear interpolation)
- vectorized table (is possible? i'm not familiar neon, don' know instructions vgather in neon)
- vectorized cordic method (this may slow since require 12+ rotation iterations 11bits accuracy)
else idea?
the first thing want check whether compiler able vectorize atan2f (y,x)
when applied array of float
s. requires @ least high optimization level such -o3
, possibly specifying relaxed "fast math" mode, in errno
handling, denormals , special inputs such infinities , nans largely ignored. approach, accuracy may in excess of required, may hard beat tuned library implementation respect performance.
the next thing try write simple scalar implementation sufficient accuracy, , have compiler vectorize it. typically means avoiding simple branches can converted branchless code through if-conversion. example of such code fast_atan2f()
shown below. intel compiler, invoked icl /o3 /fp:precise /qvec_report=2 my_atan2f.c
, vectorized successfully: my_atan2f.c(67): (col. 9) remark: loop vectorized.
double checking generated code through disassembly shows fast_atan2f()
has been inlined , vectorized using sse instructions of *ps
flavor.
if else fails, translate code of fast_atan2()
platform-specific simd intrinsics hand, should not hard do. not have sufficient experience though.
i have used simple algorithm in code, simple argument reduction produce reduced argument in [0,1], followed minimax polynomial approximation, , final step mapping result full circle. core approximation generated remez algorithm , evaluated using 2nd-order horner scheme. fused-multiply add (fma) can used available. special cases involving infinities, nans, or 0/0
not handled, sake of performance.
#include <stdio.h> #include <stdlib.h> #include <math.h> /* maximum relative error 3.6e-5 */ float fast_atan2f (float y, float x) { float a, r, s, t, c, q, ax, ay, mx, mn; ax = fabsf (x); ay = fabsf (y); mx = fmaxf (ay, ax); mn = fminf (ay, ax); = mn / mx; /* minimax polynomial approximation atan(a) on [0,1] */ s = * a; c = s * a; q = s * s; r = 0.024840285f * q + 0.18681418f; t = -0.094097948f * q - 0.33213072f; r = r * s + t; r = r * c + a; /* map full circle */ if (ay > ax) r = 1.57079637f - r; if (x < 0) r = 3.14159274f - r; if (y < 0) r = -r; return r; } /* fixes via: greg rose, kiss: bit simple. http://eprint.iacr.org/2011/007 */ static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789; #define znew (z=36969*(z&0xffff)+(z>>16)) #define wnew (w=18000*(w&0xffff)+(w>>16)) #define mwc ((znew<<16)+wnew) #define shr3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */ #define cong (jcong=69069*jcong+13579) /* 2^32 */ #define kiss ((mwc^cong)+shr3) float rand_float(void) { volatile union { float f; unsigned int i; } cvt; { cvt.i = kiss; } while (isnan(cvt.f) || isinf (cvt.f) || (fabsf (cvt.f) < powf (2.0f, -126))); return cvt.f; } int main (void) { const int n = 10000; const int m = 100000; float ref, relerr, maxrelerr = 0.0f; float arga[n], argb[n], res[n]; int i, j; printf ("testing atan2() %d test vectors\n", n*m); (j = 0; j < m; j++) { (i = 0; < n; i++) { arga[i] = rand_float(); argb[i] = rand_float(); } // loop should vectorized (i = 0; < n; i++) { res[i] = fast_atan2f (argb[i], arga[i]); } (i = 0; < n; i++) { ref = (float) atan2 ((double)argb[i], (double)arga[i]); relerr = (res[i] - ref) / ref; if ((fabsf (relerr) > maxrelerr) && (fabsf (ref) >= powf (2.0f, -126))) { // result not denormal maxrelerr = fabsf (relerr); } } }; printf ("max rel err = % 15.8e\n\n", maxrelerr); printf ("atan2(1,0) = % 15.8e\n", fast_atan2f(1,0)); printf ("atan2(-1,0) = % 15.8e\n", fast_atan2f(-1,0)); printf ("atan2(0,1) = % 15.8e\n", fast_atan2f(0,1)); printf ("atan2(0,-1) = % 15.8e\n", fast_atan2f(0,-1)); return exit_success; }
the output of program above should similar following:
testing atan2() 1000000000 test vectors max rel err = 3.53486939e-005 atan2(1,0) = 1.57079637e+000 atan2(-1,0) = -1.57079637e+000 atan2(0,1) = 0.00000000e+000 atan2(0,-1) = 3.14159274e+000
Comments
Post a Comment