𝕏

AvX2+ benefits ?

Message boards : Number crunching : AvX2+ benefits ?
Message board moderation

To post messages, you must log in.

AuthorMessage
[AF>EDLS]zOU

Send message
Joined: 15 May 15
Posts: 9
Credit: 1,008,149
RAC: 0
Message 2362 - Posted: 10 Apr 2024, 20:13:17 UTC
Last modified: 10 Apr 2024, 20:17:54 UTC

I just looked at the tasks run time on my computers and my i3-12100F is twice as fast as my i7-8700k

i3: https://denis.usj.es/denisathome/show_host_detail.php?hostid=238733
i7: https://denis.usj.es/denisathome/show_host_detail.php?hostid=238736

I know the i7 is only 8th gen and the i3 12th gen, but still that's a big difference, is that the AVX2+ benefit ?
ID: 2362 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 6 Mar 23
Posts: 40
Credit: 2,078,354
RAC: 0
Message 2364 - Posted: 10 Apr 2024, 20:49:36 UTC - in response to Message 2362.  

I just looked at the tasks run time on my computers and my i3-12100F is twice as fast as my i7-8700k


Well, my Linux machine has
Instruction Set Extensions Intel® SSE4.1, Intel® SSE4.2, Intel® AVX2, Intel® AVX-512

Computer 224473

CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16

Operating System 	Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.9 (Ootpa) [4.18.0-513.24.1.el8_9.x86_64|libc 2.28]
BOINC version 	7.20.2
Memory 	125.07 GB
Cache 	16896 KB
Swap space 	15.62 GB
Total disk space 	488.04 GB
Free Disk Space 	480.51 GB
Measured floating point speed 	5.92 billion ops/sec
Measured integer speed 	22.71 billion ops/sec
Average upload rate 	169.81 KB/sec
Average download rate 	9629.98 KB/sec
Average turnaround time 	1.23 days


Here are the three most recent results, but comparing with my other (Windows 10) machine does not make sense as it has a much slower smaller processor.

CPU type 	GenuineIntel
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz [Family 6 Model 140 Stepping 1]
Number of processors 	8


41519441 	19689063 	9 Apr 2024, 13:43:10 UTC 	10 Apr 2024, 12:10:18 UTC 	Completed and validated 	3,681.32 	3,663.96 	42.98 	Human ventricular cell models optimization v0.02
x86_64-pc-linux-gnu
41519518 	19689102 	9 Apr 2024, 13:43:10 UTC 	9 Apr 2024, 19:16:09 UTC 	Completed and validated 	3,771.43 	3,731.72 	41.53 	Human ventricular cell models optimization v0.02
x86_64-pc-linux-gnu
41519469 	19689077 	9 Apr 2024, 13:43:10 UTC 	10 Apr 2024, 10:06:43 UTC 	Completed and validated 	3,660.38 	3,642.97 	42.75 	Human ventricular cell models optimization v0.02
x86_64-pc-linux-gnu

ID: 2364 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>EDLS]zOU

Send message
Joined: 15 May 15
Posts: 9
Credit: 1,008,149
RAC: 0
Message 2369 - Posted: 11 Apr 2024, 4:56:32 UTC - in response to Message 2364.  
Last modified: 11 Apr 2024, 4:57:04 UTC

your Xeon has a lot more cache ;-)

some i3 results;
41590933	19724764	9 Apr 2024, 16:58:50 UTC	11 Apr 2024, 4:28:41 UTC	Completed and validated	2,194.71	1,727.97	39.74	Human ventricular cell models optimization v0.02
windows_x86_64
41590930	19724762	9 Apr 2024, 16:58:50 UTC	11 Apr 2024, 4:28:30 UTC	Completed and validated	2,193.70	1,725.91	40.44	Human ventricular cell models optimization v0.02
windows_x86_64
41590808	19724701	9 Apr 2024, 16:58:50 UTC	11 Apr 2024, 4:29:39 UTC	Completed and validated	2,205.79	1,740.73	39.70	Human ventricular cell models optimization v0.02
windows_x86_64
41590782	19724688	9 Apr 2024, 16:58:50 UTC	11 Apr 2024, 4:53:37 UTC	Completed and validated	2,192.18	1,723.09	41.85	Human ventricular cell models optimization v0.02


Some i7 results:

41516859	19687772	9 Apr 2024, 13:34:57 UTC	11 Apr 2024, 3:56:19 UTC	Completed and validated	4,116.32	3,035.09	40.26	Human ventricular cell models optimization v0.02
windows_x86_64
41516994	19687840	9 Apr 2024, 13:34:57 UTC	11 Apr 2024, 3:27:43 UTC	Completed and validated	4,071.97	3,009.78	38.83	Human ventricular cell models optimization v0.02
windows_x86_64
41516733	19687709	9 Apr 2024, 13:34:43 UTC	11 Apr 2024, 1:40:54 UTC	Completed and validated	4,056.41	3,018.53	42.10	Human ventricular cell models optimization v0.02
windows_x86_64
41516814	19687750	9 Apr 2024, 13:34:43 UTC	11 Apr 2024, 2:15:42 UTC	Completed and validated	4,104.62	3,013.83	41.60	Human ventricular cell models optimization v0.02
windows_x86_64
ID: 2369 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 19 Jun 24
Posts: 1
Credit: 172,811
RAC: 0
Message 2440 - Posted: 30 Jun 2024, 15:31:32 UTC - in response to Message 2369.  
Last modified: 30 Jun 2024, 16:20:41 UTC

don't think it's a matter of cache or optimization.

even a raspberry pi 5 will complete tasks in about an hour.

best I can tell based on looking at various configs, performance scaling here seems to correlate with both IPC and clock speed.
ID: 2440 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rjs5

Send message
Joined: 3 Nov 15
Posts: 23
Credit: 2,254,547
RAC: 0
Message 2441 - Posted: 3 Jul 2024, 20:02:12 UTC - in response to Message 2362.  

I just looked at the tasks run time on my computers and my i3-12100F is twice as fast as my i7-8700k

i3: https://denis.usj.es/denisathome/show_host_detail.php?hostid=238733
i7: https://denis.usj.es/denisathome/show_host_detail.php?hostid=238736

I know the i7 is only 8th gen and the i3 12th gen, but still that's a big difference, is that the AVX2+ benefit ?



The biggest benefit for AVX2 code is in the parallelism of memory move and math operations. The compilers can take advantage of those benefits ONLY if it can determine that it is safe to do. If your CPU supports:
387 = 80-bits = 1 floating point operation
SSE2 = 128-bits = 2 64-bit floating point operations simultaneously
AVX = 128-bits = 2 64-bit floating point operations simultaneously
AVX2 = 256-bits = 4 64-bit floating point operations simultaneously
AVX512 = 512-bits = 8 64-bit floating point operations simultaneously

If the code can be structured so multiple independent FP calculations can be performed together in the same cycle, then you can ALIGN the data so the compiler knows that it can generate code to perform multiple 64-bit operations simultaneously. You have to be careful how you define structures and arrange the data. You can tell the compiler to generate AVX2 code, but if the DATA cannot be (or is not) DEFINED properly, it generates non-parallel code.

Denis binary currently does not take advantage of any CPU parallelism. Denis executes a SINGLE operation at a time. Its execution speed is closely related to CPU speed and CACHE sizes.

Crunchers typically run multiple WU, but optimizers usually work optimizing a single image. A fast single image many times yields a version that runs slower when you run many WU.

I started multiple Denis WU on my i9-9980XE CPU (which supports AVX512) running Fedora Linux. I used "perf top" to quickly look into the operation of 6 Denis WU executing simultaneously.

perf top

38.37% HuVeMOp_0.02_x86_64-pc-linux-gnu [.] __ieee754_exp_avx ?
17.42% HuVeMOp_0.02_x86_64-pc-linux-gnu [.] ORd_SS_biphasic::computeRates(double, double*, double*, double*) ?
16.53% HuVeMOp_0.02_x86_64-pc-linux-gnu [.] __ieee754_pow_sse2 ?
13.86% HuVeMOp_0.02_x86_64-pc-linux-gnu [.] __exp1 ?
4.49% HuVeMOp_0.02_x86_64-pc-linux-gnu [.] __exp ?
2.38% HuVeMOp_0.02_x86_64-pc-linux-gnu [.] __ieee754_log_avx ?
0.90% HuVeMOp_0.02_x86_64-pc-linux-gnu [.] 0x0000000000000550 ?

38.37% of the time, Denis spends in the function __ieee754_exp_avx which is the AVX version of EXP.


Digging down into __ieee754_exp_avx, you see that the 38% of the time is mainly spent in 4 SINGLE PRECISION operations. You can look at the instructions like "vaddsd" which is a VECTOR add. The "s" in the instruction indicates only 1 64-bit operation is happening. The binary is only using half the 128-bit registers.

0.00 ? vmovapd %xmm5,%xmm3
1.61 ? vaddsd %xmm1,%xmm3,%xmm3
5.33 ? vaddsd %xmm2,%xmm3,%xmm3
0.01 ? vmovapd %xmm3,%xmm1
0.00 ? vmovapd %xmm7,%xmm3
3.47 ? vaddsd %xmm1,%xmm3,%xmm3
6.67 ? vsubsd %xmm3,%xmm7,%xmm7
6.26 ? vaddsd %xmm7,%xmm1,%xmm1
7.23 ? vmulsd t256+0x8,%xmm1,%xmm1
10.64 ? vaddsd %xmm3,%xmm1,%xmm1

2.47 ? vucomisd %xmm1,%xmm3
1.66 ? ? jp 3b0
1.77 ? ? jne 3b0
0.37 ? vmovsd 0x8(%rsp),%xmm0
0.14 ? vmulsd %xmm3,%xmm0,%xmm0
0.00 ?1a5: test %bpl,%bpl

I have not looked at the Denis algorithms, structure or source code, but the Linux "perf top" and associated tools are very nice to do most of the optimization needed for code. If you wire in any special code for optimization .... COMMENT it ... the code will be unchanged and may become a bottleneck later.
ID: 2441 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : AvX2+ benefits ?