c++ - Does SSE FP unit detect 0.0 operands? -



c++ - Does SSE FP unit detect 0.0 operands? -

according previous question thought optimize algorithm removing calculations when coefficient m_a, m_b 1.0 or 0.0. tried optimize algorithm , got curious results can´t explain.

first analyzer run 100k samples. parameter values read file (!):

b0=1.0 b1=-1.480838022915731 b2=1.0

a0=1.0 a1=-1.784147570544337 a2=0.854309980957510

second analyzer run same 100k samples. parameter values read file (!):

b0=1.0 b1=-1.480838022915731 b2=1.0

a0=1.0 a1=-1.784147570544337 a2=0.0 <--- a2 different !

within figures numbers on left side (grey background) represent needed cpu cycles. visible sec run parameter a2=0.0 lot faster.

i checked difference between debug , release code. release code faster (as expected). debug , release code have same unusual behaviour when parameter a2 modified.

then checked asm code. noticed sse instructions used. valid because compiled /arch:sse2. hence disabled sse. resulting code doesn´t utilize sse anymore , performance not depend on parameter value a2 anymore (as expected)

therefore came conclusion kind of performance benefit when sse used , sse engine detects a2 0.0 , hence omitts obsolete multiplication , subtraction. never heard , tried find info without success.

so has explanation performance results ?

for completeness relevant asm code release version:

00f43ec0 mov edx,dword ptr [ebx] 00f43ec2 movss xmm0,dword ptr [eax+edi*4] 00f43ec7 cmp edx,dword ptr [ebx+4] 00f43eca je $ln419+193h (0f43f9dh) 00f43ed0 mov esi,dword ptr [ebx+4] 00f43ed3 lea eax,[edx+68h] 00f43ed6 lea ecx,[eax-68h] 00f43ed9 cvtps2pd xmm0,xmm0 00f43edc cmp ecx,esi 00f43ede je $ln419+180h (0f43f8ah) 00f43ee4 movss xmm1,dword ptr [eax+4] 00f43ee9 mov ecx,dword ptr [eax] 00f43eeb mov edx,dword ptr [eax-24h] 00f43eee movss xmm3,dword ptr [edx+4] 00f43ef3 cvtps2pd xmm1,xmm1 00f43ef6 mulsd xmm1,xmm0 00f43efa movss xmm0,dword ptr [ecx] 00f43efe cvtps2pd xmm4,xmm0 00f43f01 cvtps2pd xmm3,xmm3 00f43f04 mulsd xmm3,xmm4 00f43f08 xorps xmm2,xmm2 00f43f0b cvtpd2ps xmm2,xmm1 00f43f0f movss xmm1,dword ptr [ecx+4] 00f43f14 cvtps2pd xmm4,xmm1 00f43f17 cvtps2pd xmm2,xmm2 00f43f1a subsd xmm2,xmm3 00f43f1e movss xmm3,dword ptr [edx+8] 00f43f23 mov edx,dword ptr [eax-48h] 00f43f26 cvtps2pd xmm3,xmm3 00f43f29 mulsd xmm3,xmm4 00f43f2d subsd xmm2,xmm3 00f43f31 movss xmm3,dword ptr [edx+4] 00f43f36 cvtps2pd xmm4,xmm0 00f43f39 cvtps2pd xmm3,xmm3 00f43f3c mulsd xmm3,xmm4 00f43f40 movss xmm4,dword ptr [edx] 00f43f44 cvtps2pd xmm4,xmm4 00f43f47 cvtpd2ps xmm2,xmm2 00f43f4b xorps xmm5,xmm5 00f43f4e cvtss2sd xmm5,xmm2 00f43f52 mulsd xmm4,xmm5 00f43f56 addsd xmm3,xmm4 00f43f5a movss xmm4,dword ptr [edx+8] 00f43f5f cvtps2pd xmm1,xmm1 00f43f62 movss dword ptr [ecx+4],xmm0 00f43f67 mov edx,dword ptr [eax] 00f43f69 cvtps2pd xmm4,xmm4 00f43f6c mulsd xmm4,xmm1 00f43f70 addsd xmm3,xmm4 00f43f74 xorps xmm1,xmm1 00f43f77 cvtpd2ps xmm1,xmm3 00f43f7b movss dword ptr [edx],xmm2 00f43f7f movaps xmm0,xmm1 00f43f82 add together eax,70h 00f43f85 jmp $ln419+0cch (0f43ed6h) 00f43f8a movss xmm1,dword ptr [ebx+10h] 00f43f8f cvtps2pd xmm1,xmm1 00f43f92 mulsd xmm1,xmm0 00f43f96 xorps xmm0,xmm0 00f43f99 cvtpd2ps xmm0,xmm1 00f43f9d mov eax,dword ptr [ebp-4ch] 00f43fa0 movss dword ptr [eax+edi*4],xmm0 00f43fa5 mov ecx,dword ptr [ebp-38h] 00f43fa8 mov eax,dword ptr [ebp-3ch] 00f43fab sub ecx,eax 00f43fad inc edi 00f43fae sar ecx,2 00f43fb1 cmp edi,ecx 00f43fb3 jb $ln419+0b6h (0f43ec0h)

edit: replaced debug asm code release code.

there no outs fp multiplication on sse. it's pipelined operation short latency, adding outs complicate instruction retirement while providing 0 performance benefit. instructions commonly have data-dependent execution characteristics on modern processors split , square root (ignoring subnormals, effect wider array of instructions). extensively documented both intel , amd, , independently agner fog.

so why see alter in performance? explanation encountering stalls due subnormal inputs or results; mutual dsp filters , delays, 1 have. without seeing code , input data, it's impossible sure what's happening, it's far explanation. if so, can prepare problem setting daz , ftz bits in mxcsr.

intel documentation: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (consult latency tables in appendix, note there's fixed value mulss , mulsd.)

amd 16h instruction latencies (excel spreadsheet): http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/amd64_16h_instrlatency_1.1.xlsx

agner fog's instruction latency tables both intel , amd: http://www.agner.org/optimize/instruction_tables.pdf

c++ performance sse computer-architecture

Comments

Popular posts from this blog

xslt - DocBook 5 to PDF transform failing with error: "fo:flow" is missing child elements. Required content model: marker* -

mediawiki - How do I insert tables inside infoboxes on Wikia pages? -

SQL Server : need assitance parsing delimted data and returning a long concatenated string -