AvX2+ benefits ?

Message boards : Number crunching : AvX2+ benefits ?
Message board moderation

Author	Message
[AF>EDLS]zOU Send message Joined: 15 May 15 Posts: 9 Credit: 1,008,149 RAC: 0	Message 2362 - Posted: 10 Apr 2024, 20:13:17 UTC Last modified: 10 Apr 2024, 20:17:54 UTC I just looked at the tasks run time on my computers and my i3-12100F is twice as fast as my i7-8700k i3: https://denis.usj.es/denisathome/show_host_detail.php?hostid=238733 i7: https://denis.usj.es/denisathome/show_host_detail.php?hostid=238736 I know the i7 is only 8th gen and the i3 12th gen, but still that's a big difference, is that the AVX2+ benefit ? ID: 2362 · Rating: 0 · rate: / Reply Quote

Jean-David Beyer Send message Joined: 6 Mar 23 Posts: 78 Credit: 2,443,839 RAC: 0	Message 2364 - Posted: 10 Apr 2024, 20:49:36 UTC - in response to Message 2362. I just looked at the tasks run time on my computers and my i3-12100F is twice as fast as my i7-8700k Well, my Linux machine has Instruction Set Extensions Intel® SSE4.1, Intel® SSE4.2, Intel® AVX2, Intel® AVX-512 Computer 224473 CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.9 (Ootpa) [4.18.0-513.24.1.el8_9.x86_64\|libc 2.28] BOINC version 7.20.2 Memory 125.07 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 480.51 GB Measured floating point speed 5.92 billion ops/sec Measured integer speed 22.71 billion ops/sec Average upload rate 169.81 KB/sec Average download rate 9629.98 KB/sec Average turnaround time 1.23 days Here are the three most recent results, but comparing with my other (Windows 10) machine does not make sense as it has a much slower smaller processor. CPU type GenuineIntel 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz [Family 6 Model 140 Stepping 1] Number of processors 8 41519441 19689063 9 Apr 2024, 13:43:10 UTC 10 Apr 2024, 12:10:18 UTC Completed and validated 3,681.32 3,663.96 42.98 Human ventricular cell models optimization v0.02 x86_64-pc-linux-gnu 41519518 19689102 9 Apr 2024, 13:43:10 UTC 9 Apr 2024, 19:16:09 UTC Completed and validated 3,771.43 3,731.72 41.53 Human ventricular cell models optimization v0.02 x86_64-pc-linux-gnu 41519469 19689077 9 Apr 2024, 13:43:10 UTC 10 Apr 2024, 10:06:43 UTC Completed and validated 3,660.38 3,642.97 42.75 Human ventricular cell models optimization v0.02 x86_64-pc-linux-gnu ID: 2364 · Rating: 0 · rate: / Reply Quote

[AF>EDLS]zOU Send message Joined: 15 May 15 Posts: 9 Credit: 1,008,149 RAC: 0	Message 2369 - Posted: 11 Apr 2024, 4:56:32 UTC - in response to Message 2364. Last modified: 11 Apr 2024, 4:57:04 UTC your Xeon has a lot more cache ;-) some i3 results; 41590933 19724764 9 Apr 2024, 16:58:50 UTC 11 Apr 2024, 4:28:41 UTC Completed and validated 2,194.71 1,727.97 39.74 Human ventricular cell models optimization v0.02 windows_x86_64 41590930 19724762 9 Apr 2024, 16:58:50 UTC 11 Apr 2024, 4:28:30 UTC Completed and validated 2,193.70 1,725.91 40.44 Human ventricular cell models optimization v0.02 windows_x86_64 41590808 19724701 9 Apr 2024, 16:58:50 UTC 11 Apr 2024, 4:29:39 UTC Completed and validated 2,205.79 1,740.73 39.70 Human ventricular cell models optimization v0.02 windows_x86_64 41590782 19724688 9 Apr 2024, 16:58:50 UTC 11 Apr 2024, 4:53:37 UTC Completed and validated 2,192.18 1,723.09 41.85 Human ventricular cell models optimization v0.02 Some i7 results: 41516859 19687772 9 Apr 2024, 13:34:57 UTC 11 Apr 2024, 3:56:19 UTC Completed and validated 4,116.32 3,035.09 40.26 Human ventricular cell models optimization v0.02 windows_x86_64 41516994 19687840 9 Apr 2024, 13:34:57 UTC 11 Apr 2024, 3:27:43 UTC Completed and validated 4,071.97 3,009.78 38.83 Human ventricular cell models optimization v0.02 windows_x86_64 41516733 19687709 9 Apr 2024, 13:34:43 UTC 11 Apr 2024, 1:40:54 UTC Completed and validated 4,056.41 3,018.53 42.10 Human ventricular cell models optimization v0.02 windows_x86_64 41516814 19687750 9 Apr 2024, 13:34:43 UTC 11 Apr 2024, 2:15:42 UTC Completed and validated 4,104.62 3,013.83 41.60 Human ventricular cell models optimization v0.02 windows_x86_64 ID: 2369 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 19 Jun 24 Posts: 24 Credit: 530,370 RAC: 0	Message 2440 - Posted: 30 Jun 2024, 15:31:32 UTC - in response to Message 2369. Last modified: 30 Jun 2024, 16:20:41 UTC don't think it's a matter of cache or optimization. even a raspberry pi 5 will complete tasks in about an hour. best I can tell based on looking at various configs, performance scaling here seems to correlate with both IPC and clock speed. ID: 2440 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 3 Nov 15 Posts: 23 Credit: 2,320,931 RAC: 0	Message 2441 - Posted: 3 Jul 2024, 20:02:12 UTC - in response to Message 2362. I just looked at the tasks run time on my computers and my i3-12100F is twice as fast as my i7-8700k i3: https://denis.usj.es/denisathome/show_host_detail.php?hostid=238733 i7: https://denis.usj.es/denisathome/show_host_detail.php?hostid=238736 I know the i7 is only 8th gen and the i3 12th gen, but still that's a big difference, is that the AVX2+ benefit ? The biggest benefit for AVX2 code is in the parallelism of memory move and math operations. The compilers can take advantage of those benefits ONLY if it can determine that it is safe to do. If your CPU supports: 387 = 80-bits = 1 floating point operation SSE2 = 128-bits = 2 64-bit floating point operations simultaneously AVX = 128-bits = 2 64-bit floating point operations simultaneously AVX2 = 256-bits = 4 64-bit floating point operations simultaneously AVX512 = 512-bits = 8 64-bit floating point operations simultaneously If the code can be structured so multiple independent FP calculations can be performed together in the same cycle, then you can ALIGN the data so the compiler knows that it can generate code to perform multiple 64-bit operations simultaneously. You have to be careful how you define structures and arrange the data. You can tell the compiler to generate AVX2 code, but if the DATA cannot be (or is not) DEFINED properly, it generates non-parallel code. Denis binary currently does not take advantage of any CPU parallelism. Denis executes a SINGLE operation at a time. Its execution speed is closely related to CPU speed and CACHE sizes. Crunchers typically run multiple WU, but optimizers usually work optimizing a single image. A fast single image many times yields a version that runs slower when you run many WU. I started multiple Denis WU on my i9-9980XE CPU (which supports AVX512) running Fedora Linux. I used "perf top" to quickly look into the operation of 6 Denis WU executing simultaneously. perf top 38.37% HuVeMOp_0.02_x86_64-pc-linux-gnu [.] __ieee754_exp_avx ? 17.42% HuVeMOp_0.02_x86_64-pc-linux-gnu [.] ORd_SS_biphasic::computeRates(double, double, double, double*) ? 16.53% HuVeMOp_0.02_x86_64-pc-linux-gnu [.] __ieee754_pow_sse2 ? 13.86% HuVeMOp_0.02_x86_64-pc-linux-gnu [.] __exp1 ? 4.49% HuVeMOp_0.02_x86_64-pc-linux-gnu [.] __exp ? 2.38% HuVeMOp_0.02_x86_64-pc-linux-gnu [.] __ieee754_log_avx ? 0.90% HuVeMOp_0.02_x86_64-pc-linux-gnu [.] 0x0000000000000550 ? 38.37% of the time, Denis spends in the function __ieee754_exp_avx which is the AVX version of EXP. Digging down into __ieee754_exp_avx, you see that the 38% of the time is mainly spent in 4 SINGLE PRECISION operations. You can look at the instructions like "vaddsd" which is a VECTOR add. The "s" in the instruction indicates only 1 64-bit operation is happening. The binary is only using half the 128-bit registers. 0.00 ? vmovapd %xmm5,%xmm3 1.61 ? vaddsd %xmm1,%xmm3,%xmm3 5.33 ? vaddsd %xmm2,%xmm3,%xmm3 0.01 ? vmovapd %xmm3,%xmm1 0.00 ? vmovapd %xmm7,%xmm3 3.47 ? vaddsd %xmm1,%xmm3,%xmm3 6.67 ? vsubsd %xmm3,%xmm7,%xmm7 6.26 ? vaddsd %xmm7,%xmm1,%xmm1 7.23 ? vmulsd t256+0x8,%xmm1,%xmm1 10.64 ? vaddsd %xmm3,%xmm1,%xmm1 2.47 ? vucomisd %xmm1,%xmm3 1.66 ? ? jp 3b0 1.77 ? ? jne 3b0 0.37 ? vmovsd 0x8(%rsp),%xmm0 0.14 ? vmulsd %xmm3,%xmm0,%xmm0 0.00 ?1a5: test %bpl,%bpl I have not looked at the Denis algorithms, structure or source code, but the Linux "perf top" and associated tools are very nice to do most of the optimization needed for code. If you wire in any special code for optimization .... COMMENT it ... the code will be unchanged and may become a bottleneck later. ID: 2441 · Rating: 0 · rate: / Reply Quote

Message boards : Number crunching : AvX2+ benefits ?