c++ - Does SSE FP unit detect 0.0 operands? -

August 15, 2014

according previous question thought optimize algorithm removing calculations when coefficient m_a, m_b 1.0 or 0.0. tried optimize algorithm , got curious results can´t explain.

first analyzer run 100k samples. parameter values read file (!):

b0=1.0 b1=-1.480838022915731 b2=1.0

a0=1.0 a1=-1.784147570544337 a2=0.854309980957510

second analyzer run same 100k samples. parameter values read file (!):

b0=1.0 b1=-1.480838022915731 b2=1.0

a0=1.0 a1=-1.784147570544337 a2=0.0 <--- a2 different !

within figures numbers on left side (grey background) represent needed cpu cycles. visible sec run parameter a2=0.0 lot faster.

i checked difference between debug , release code. release code faster (as expected). debug , release code have same unusual behaviour when parameter a2 modified.

then checked asm code. noticed sse instructions used. valid because compiled /arch:sse2. hence disabled sse. resulting code doesn´t utilize sse anymore , performance not depend on parameter value a2 anymore (as expected)

therefore came conclusion kind of performance benefit when sse used , sse engine detects a2 0.0 , hence omitts obsolete multiplication , subtraction. never heard , tried find info without success.

so has explanation performance results ?

for completeness relevant asm code release version:

00f43ec0  mov         edx,dword ptr [ebx]   00f43ec2  movss       xmm0,dword ptr [eax+edi*4]   00f43ec7  cmp         edx,dword ptr [ebx+4]   00f43eca  je          $ln419+193h (0f43f9dh)   00f43ed0  mov         esi,dword ptr [ebx+4]   00f43ed3  lea         eax,[edx+68h]   00f43ed6  lea         ecx,[eax-68h]   00f43ed9  cvtps2pd    xmm0,xmm0   00f43edc  cmp         ecx,esi   00f43ede  je          $ln419+180h (0f43f8ah)   00f43ee4  movss       xmm1,dword ptr [eax+4]   00f43ee9  mov         ecx,dword ptr [eax]   00f43eeb  mov         edx,dword ptr [eax-24h]   00f43eee  movss       xmm3,dword ptr [edx+4]   00f43ef3  cvtps2pd    xmm1,xmm1   00f43ef6  mulsd       xmm1,xmm0   00f43efa  movss       xmm0,dword ptr [ecx]   00f43efe  cvtps2pd    xmm4,xmm0   00f43f01  cvtps2pd    xmm3,xmm3   00f43f04  mulsd       xmm3,xmm4   00f43f08  xorps       xmm2,xmm2   00f43f0b  cvtpd2ps    xmm2,xmm1   00f43f0f  movss       xmm1,dword ptr [ecx+4]   00f43f14  cvtps2pd    xmm4,xmm1   00f43f17  cvtps2pd    xmm2,xmm2   00f43f1a  subsd       xmm2,xmm3   00f43f1e  movss       xmm3,dword ptr [edx+8]   00f43f23  mov         edx,dword ptr [eax-48h]   00f43f26  cvtps2pd    xmm3,xmm3   00f43f29  mulsd       xmm3,xmm4   00f43f2d  subsd       xmm2,xmm3   00f43f31  movss       xmm3,dword ptr [edx+4]   00f43f36  cvtps2pd    xmm4,xmm0   00f43f39  cvtps2pd    xmm3,xmm3   00f43f3c  mulsd       xmm3,xmm4   00f43f40  movss       xmm4,dword ptr [edx]   00f43f44  cvtps2pd    xmm4,xmm4   00f43f47  cvtpd2ps    xmm2,xmm2   00f43f4b  xorps       xmm5,xmm5   00f43f4e  cvtss2sd    xmm5,xmm2   00f43f52  mulsd       xmm4,xmm5   00f43f56  addsd       xmm3,xmm4   00f43f5a  movss       xmm4,dword ptr [edx+8]   00f43f5f  cvtps2pd    xmm1,xmm1   00f43f62  movss       dword ptr [ecx+4],xmm0   00f43f67  mov         edx,dword ptr [eax]   00f43f69  cvtps2pd    xmm4,xmm4   00f43f6c  mulsd       xmm4,xmm1   00f43f70  addsd       xmm3,xmm4   00f43f74  xorps       xmm1,xmm1   00f43f77  cvtpd2ps    xmm1,xmm3   00f43f7b  movss       dword ptr [edx],xmm2   00f43f7f  movaps      xmm0,xmm1   00f43f82   add together         eax,70h   00f43f85  jmp         $ln419+0cch (0f43ed6h)   00f43f8a  movss       xmm1,dword ptr [ebx+10h]   00f43f8f  cvtps2pd    xmm1,xmm1   00f43f92  mulsd       xmm1,xmm0   00f43f96  xorps       xmm0,xmm0   00f43f99  cvtpd2ps    xmm0,xmm1   00f43f9d  mov         eax,dword ptr [ebp-4ch]   00f43fa0  movss       dword ptr [eax+edi*4],xmm0   00f43fa5  mov         ecx,dword ptr [ebp-38h]   00f43fa8  mov         eax,dword ptr [ebp-3ch]   00f43fab  sub         ecx,eax   00f43fad  inc         edi   00f43fae  sar         ecx,2   00f43fb1  cmp         edi,ecx   00f43fb3  jb          $ln419+0b6h (0f43ec0h)

edit: replaced debug asm code release code.

there no outs fp multiplication on sse. it's pipelined operation short latency, adding outs complicate instruction retirement while providing 0 performance benefit. instructions commonly have data-dependent execution characteristics on modern processors split , square root (ignoring subnormals, effect wider array of instructions). extensively documented both intel , amd, , independently agner fog.

so why see alter in performance? explanation encountering stalls due subnormal inputs or results; mutual dsp filters , delays, 1 have. without seeing code , input data, it's impossible sure what's happening, it's far explanation. if so, can prepare problem setting daz , ftz bits in mxcsr.

intel documentation: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (consult latency tables in appendix, note there's fixed value mulss , mulsd.)

amd 16h instruction latencies (excel spreadsheet): http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/amd64_16h_instrlatency_1.1.xlsx

agner fog's instruction latency tables both intel , amd: http://www.agner.org/optimize/instruction_tables.pdf

c++ performance sse computer-architecture

Search This Blog

New Th

c++ - Does SSE FP unit detect 0.0 operands? -

Comments

Post a Comment

Popular posts from this blog

xslt - DocBook 5 to PDF transform failing with error: "fo:flow" is missing child elements. Required content model: marker* -

mediawiki - How do I insert tables inside infoboxes on Wikia pages? -

SQL Server : need assitance parsing delimted data and returning a long concatenated string -