Optimized app ?

Message boards : Number crunching : Optimized app ?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
mikey
Avatar

Send message
Joined: 5 Jul 15
Posts: 10
Credit: 4,065,836
RAC: 54
Message 583 - Posted: 25 Oct 2015, 23:57:48 UTC - in response to Message 581.  

AVX2 version.

Download:
http://optos.sesef.pl/denis or https://dl.dropboxusercontent.com/u/1452459/denis/denis1.6.1_avx2.zip

You should get up to ~30% speedup depends on cpu type, on new Intel Skylake even more.


My laptop trashed all the units using this one, I reverted to the old one and it is fine again.
ID: 583 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cartoonman

Send message
Joined: 22 Oct 15
Posts: 3
Credit: 719,262
RAC: 0
Message 584 - Posted: 26 Oct 2015, 2:32:13 UTC - in response to Message 583.  

^ Same result. On W7 i5 2400
ID: 584 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ross*

Send message
Joined: 8 Jun 15
Posts: 1
Credit: 8,103,612
RAC: 0
Message 585 - Posted: 26 Oct 2015, 5:27:55 UTC - in response to Message 581.  

Hi
They go fine on 5930 +5960s
Ross*
ID: 585 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
sesef

Send message
Joined: 22 Apr 15
Posts: 4
Credit: 17,166,398
RAC: 0
Message 586 - Posted: 26 Oct 2015, 6:31:42 UTC - in response to Message 583.  

AVX2 version.

Download:
http://optos.sesef.pl/denis or https://dl.dropboxusercontent.com/u/1452459/denis/denis1.6.1_avx2.zip

You should get up to ~30% speedup depends on cpu type, on new Intel Skylake even more.


My laptop trashed all the units using this one, I reverted to the old one and it is fine again.


^ Same result. On W7 i5 2400



AVX2 is available only on i3/i5/i7/xeon 4th gen or newer processors. So only i3/i5/i7 4xxx+ or Xeon v3+
ID: 586 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Chilean
Avatar

Send message
Joined: 9 Apr 15
Posts: 11
Credit: 3,149,460
RAC: 0
Message 587 - Posted: 26 Oct 2015, 12:04:11 UTC - in response to Message 586.  
Last modified: 26 Oct 2015, 12:06:23 UTC

AVX2 version.

Download:
http://optos.sesef.pl/denis or https://dl.dropboxusercontent.com/u/1452459/denis/denis1.6.1_avx2.zip

You should get up to ~30% speedup depends on cpu type, on new Intel Skylake even more.


My laptop trashed all the units using this one, I reverted to the old one and it is fine again.


^ Same result. On W7 i5 2400



AVX2 is available only on i3/i5/i7/xeon 4th gen or newer processors. So only i3/i5/i7 4xxx+ or Xeon v3+


Is an AVX "1" version worth it at all? I have a 3rd generation CPU on my main laptop which only support AVX. Not AVX2.
I'll try this app on my 4th generation laptop tho.
ID: 587 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Apr 15
Posts: 28
Credit: 1,426,883
RAC: 1,393
Message 590 - Posted: 26 Oct 2015, 17:13:19 UTC
Last modified: 26 Oct 2015, 17:18:58 UTC

The AVX2 version looks very good to me. For the 600 series, I am getting about 2 minutes 32 seconds running on four cores of an i7-4771 (another core supports a GPU on Folding, and the other three cores are largely free). This is on Win7 64-bit. The core temps are a little higher, which is usual for AVX2 work, averaging about 70 C now.
ID: 590 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Nosferatu*

Send message
Joined: 10 May 15
Posts: 1
Credit: 8,898,939
RAC: 0
Message 607 - Posted: 30 Oct 2015, 9:28:27 UTC

Is there a Windows 32 bit app available? I have 3 machines that are unable to run 64 bit Windows.
ID: 607 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Chilean
Avatar

Send message
Joined: 9 Apr 15
Posts: 11
Credit: 3,149,460
RAC: 0
Message 609 - Posted: 30 Oct 2015, 17:29:31 UTC - in response to Message 607.  

Is there a Windows 32 bit app available? I have 3 machines that are unable to run 64 bit Windows.


Or AVX "1" ? :)
ID: 609 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>FAH-Addict.net]toTOW

Send message
Joined: 11 Apr 15
Posts: 24
Credit: 3,651,888
RAC: 107
Message 610 - Posted: 31 Oct 2015, 15:07:15 UTC

I'm not convinced by the AVX2 application ... on my i7 4710HQ, it's actually slower than the previous application :(

The only case where it's faster is if I run it on only 4 cores instead of 8 threads. But in this case, the speed improvement is not enough to compensate the loss of the 4 other processes. Doing 4 WUs every 9min30 is finally producing less than doing 8 WUs in 12 minutes (with previous application, it varies between 12 and 15 minutes with the new one).
ID: 610 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kain

Send message
Joined: 16 Apr 15
Posts: 20
Credit: 1,592,862
RAC: 5,391
Message 611 - Posted: 31 Oct 2015, 15:10:24 UTC - in response to Message 610.  

I'm not convinced by the AVX2 application ... on my i7 4710HQ, it's actually slower than the previous application :(

The only case where it's faster is if I run it on only 4 cores instead of 8 threads. But in this case, the speed improvement is not enough to compensate the loss of the 4 other processes. Doing 4 WUs every 9min30 is finally producing less than doing 8 WUs in 12 minutes (with previous application, it varies between 12 and 15 minutes with the new one).


It's normal behaviour.
ID: 611 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>FAH-Addict.net]toTOW

Send message
Joined: 11 Apr 15
Posts: 24
Credit: 3,651,888
RAC: 107
Message 613 - Posted: 31 Oct 2015, 16:21:58 UTC

So this kind of optimizations shouldn't be used on HT processors ?

It might be complicated to include such logic in assignement process or in code logic :(
ID: 613 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Curious

Send message
Joined: 18 Oct 15
Posts: 3
Credit: 210,007
RAC: 0
Message 615 - Posted: 1 Nov 2015, 1:54:15 UTC - in response to Message 613.  
Last modified: 1 Nov 2015, 2:04:08 UTC

Yes, in Asteroids@Home message boards they briefly explain why AVX isn't suggested for CPU with Hyper Threading, see here. Moreover on Primegrid message boards is pointed out that simulating a duble number of cores causes the chip to produce more heat. I've noticed that AVX2 is significantly slower than SSE4.1 on my HT Intel (Haswell) CPU too.

EDIT
URL added
ID: 615 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Apr 15
Posts: 28
Credit: 1,426,883
RAC: 1,393
Message 619 - Posted: 1 Nov 2015, 15:52:59 UTC
Last modified: 1 Nov 2015, 15:58:42 UTC

My tests show the following on an i7-4771 CPU, Win7 64-bit:
Comparison of Sesef's DENIS optimizations on the "3XP 1800" series work units:

With Sesef 1.6.1 AVX2 optimization:
DENIS running on 8 virtual cores - 9 minutes 42 seconds (CPU temp - 63 C average)
DENIS running on 4 virtual cores (other 4 cores free) - 7 minutes 7 seconds (CPU temp - 55 C average)

With Sesef 1.5.5 SSE3 optimization:
DENIS running on 8 virtual cores - 10 minutes 43 seconds (CPU temp - 60 C average)
DENIS running on 4 virtual cores (other 4 cores free) - 8 minutes 9 seconds (CPU temp - 54 C average)

So in each case, the AVX2 optimization is faster than the SSE3 optimization. I doubt that Sesef would have released it otherwise.
However, the temps can build up, especially if you have a GPU card. That could cause throttling of the CPU in some cases, thus lowering its speed.
ID: 619 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Curious

Send message
Joined: 18 Oct 15
Posts: 3
Credit: 210,007
RAC: 0
Message 620 - Posted: 1 Nov 2015, 17:22:28 UTC - in response to Message 619.  
Last modified: 1 Nov 2015, 17:26:00 UTC

I assume you don't know about Crunch3r's SSE4.1 app version. This is the one I'm referring to (not sesef's SSE3 one): it's two times faster than AVX2 on my CPU when running one WU at a time without any other application (distributed computing nor not-DC) so no throttling at all. I know it's a really simple scenario but it allows you to understand things easily.
Sure enough on other CPUs it will perform differently than my Haswll CPU with Hyper Threading and factory power limitation (which I removed through Intel extreme tuning utility though), but I wrote it clearly that I was referring to my particular case.

EDIT
Corrected typos
ID: 620 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Trotador

Send message
Joined: 9 Apr 15
Posts: 11
Credit: 13,598,756
RAC: 11,149
Message 621 - Posted: 1 Nov 2015, 17:47:49 UTC

I would like to test AVX2 application but it seems to be windows only.

Is it possible to make a linux version?
ID: 621 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Apr 15
Posts: 28
Credit: 1,426,883
RAC: 1,393
Message 622 - Posted: 1 Nov 2015, 18:18:29 UTC - in response to Message 620.  

I assume you don't know about Crunch3r's SSE4.1 app version. This is the one I'm referring to (not sesef's SSE3 one): it's two times faster than AVX2 on my CPU when running one WU at a time without any other application (distributed computing nor not-DC) so no throttling at all. I know it's a really simple scenario but it allows you to understand things easily.
Sure enough on other CPUs it will perform differently than my Haswll CPU with Hyper Threading and factory power limitation (which I removed through Intel extreme tuning utility though), but I wrote it clearly that I was referring to my particular case.

EDIT
Corrected typos

I tried Cruncher's app when it first came out; it was slightly faster than Sesef, but I don't think better than the AVX2 that I recall. I hope that is simple enough.
ID: 622 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Curious

Send message
Joined: 18 Oct 15
Posts: 3
Credit: 210,007
RAC: 0
Message 623 - Posted: 1 Nov 2015, 19:02:43 UTC - in response to Message 622.  

It varies from CPU to CPU: on my system (a laptop pc) it works as I said. On your system (which I assume is a desktop pc) it works as you said. If you don't believe me, I don't care but surfing the web one finds out that some people see better results with SSEx than AVXx in some projects (not only DENIS). The only thing to do to be sure is to try by ourselves.
ID: 623 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mm67

Send message
Joined: 12 Jul 15
Posts: 7
Credit: 43,028,399
RAC: 0
Message 624 - Posted: 1 Nov 2015, 19:15:27 UTC

On i7-4770K AVX2 version is clearly faster :

AVX2 : http://i826.photobucket.com/albums/zz182/mm_67/sse41_zpsqpu1afzg.jpg

SSE4.1 : http://i826.photobucket.com/albums/zz182/mm_67/avx2_zpsi4aydh50.jpg
ID: 624 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rjs5

Send message
Joined: 3 Nov 15
Posts: 22
Credit: 1,136,105
RAC: 371
Message 632 - Posted: 3 Nov 2015, 5:12:47 UTC - in response to Message 624.  

Denis performance seems to be strongly related to the compiler version and the way the compiler handles the math library.

I built Denis in a Ubuntu 14.04 virtualbox with two compilers. gcc v4.8.4 and Intel icc v16. I ran the current Denis 64-bit application, the recompiled gcc and icc versions.

The user time was 8.872s, 7.888s and 2.996s respectively on the VM running on my i7-5930K CPU.

The perf tool reports show that GCC spends a majority of time in the power and exponential functions. The icc version seems to be about 2x to 3x faster than the gcc versions when using the same (I think) standard libm libraries.

the gcc version seems to spend about 75% of its execution time in the power and exponential libm functions. The icc version seems to be able to eliminate most of that time.


time ./CRLP2011EPI_105_x86_64-pc-linux-gnu in

real 0m 10.958s
user 0m 8.872s
sys 0m 0.016s



rjs@rjs-VirtualBox:~/boinc/denis/denis-boinc-baseapp$ time ./denis.icc in

real 0m 5.583s
user 0m 2.996s
sys 0m 0.028s

rjs@rjs-VirtualBox:~/boinc/denis/denis-boinc-baseapp$ time ./denis.g++ in

real 0m 10.465s
user 0m 7.888s
sys 0m 0.020s



export BDIR=/home/rjs/boinc/source/boinc
export CC=icc
export CC=g++
OPT=" -g -O3 "
$CC $OPT app.cpp -I$BDIR -I$BDIR/api -I$BDIR/lib -lboinc -lboinc_api -o denis.$CC


rjs@rjs-VirtualBox:~/boinc/denis/denis-boinc-baseapp$ icc -v
icc version 16.0.0 (gcc version 4.8.0 compatibility)
rjs@rjs-VirtualBox:~/boinc/denis/denis-boinc-baseapp$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.4-2ubuntu1~14.04'
gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)


Perf from the icc run "denis.icc in"

41.38% denis.icc denis.icc [.] _Z12computeRatesdPdS_S_S_
37.36% denis.icc denis.icc [.] __libm_exp_e7
9.07% denis.icc denis.icc [.] __libm_pow_e7
4.91% denis.icc denis.icc [.] main
3.71% denis.icc denis.icc [.] __libm_log_e7
1.99% denis.icc denis.icc [.] exp
0.28% denis.icc denis.icc [.] boinc_time_to_checkpoint@plt


Perf from the g++ run "denis.g++ in"

31.43% denis.g++ libm-2.19.so [.] __ieee754_pow_sse2
30.57% denis.g++ libm-2.19.so [.] __ieee754_exp_sse2
15.96% denis.g++ libm-2.19.so [.] __exp1
9.97% denis.g++ denis.g++ [.] _Z12computeRatesdPdS_S_S_
3.62% denis.g++ libm-2.19.so [.] __ieee754_log_sse2
3.21% denis.g++ libm-2.19.so [.] __GI___exp
2.18% denis.g++ libm-2.19.so [.] __pow
1.34% denis.g++ denis.g++ [.] _Z10solveModeliPdS_S_S_6CONFIGRS_i
0.35% denis.g++ libm-2.19.so [.] @plt


ldd denis.icc
linux-vdso.so.1 => (0x00007fff7ffd6000)
libboinc.so.7 => /usr/lib/libboinc.so.7 (0x00007f4c7a288000)
libboinc_api.so.7 => /usr/lib/libboinc_api.so.7 (0x00007f4c7a068000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f4c79d62000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f4c79a5e000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f4c79848000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4c79483000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f4c7927f000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f4c79061000)
/lib64/ld-linux-x86-64.so.2 (0x00007f4c7a4fd000)



ldd denis.g++
linux-vdso.so.1 => (0x00007ffc6f3bd000)
libboinc.so.7 => /usr/lib/libboinc.so.7 (0x00007fd8b6b09000)
libboinc_api.so.7 => /usr/lib/libboinc_api.so.7 (0x00007fd8b68e9000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fd8b65e5000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fd8b62df000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fd8b60c9000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fd8b5d04000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fd8b5ae6000)
/lib64/ld-linux-x86-64.so.2 (0x00007fd8b6d7e000)
ID: 632 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[VENETO] boboviz

Send message
Joined: 9 Apr 15
Posts: 155
Credit: 644,645
RAC: 379
Message 636 - Posted: 4 Nov 2015, 10:10:05 UTC - in response to Message 632.  
Last modified: 4 Nov 2015, 10:10:27 UTC

The perf tool reports show that GCC spends a majority of time in the power and exponential functions. The icc version seems to be about 2x to 3x faster than the gcc versions when using the same (I think) standard libm libraries.


Some questions:
1) Intel icc is free? Can the project use this compiler?
2) Do you plan to release your app version, like Sefef/Chrun3er??
3) It's sse3, sse4.1 or avx app??
4) Have the same results in Windows?
5) Is this app compatible with Amd cpu??
ID: 636 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : Optimized app ?