Very long wus
Message boards :
Number crunching :
Very long wus
Message board moderation
Author | Message |
---|---|
Send message Joined: 9 Apr 15 Posts: 171 Credit: 1,552,856 RAC: 1 |
On my I5-5200 mobile, after 2h the wus are at 7% Is it normal?? |
Send message Joined: 5 Oct 15 Posts: 17 Credit: 1,335,501 RAC: 0 |
You're not alone, at least. On my i5-5300 (mobile), currently being heavily used, the WU known as GD_jcarro_20160714202312000000_ThirdSimulations_SteadyState8000Schmidt98_conf_1513.xml_0 is at 4.55% after 1 hour 30 minutes. This laptop has also crunched through 4 other WUs today, and averaged about 2 hours 30 minutes for each. Those were all "SteadyState6XX" units. |
Send message Joined: 9 Apr 15 Posts: 171 Credit: 1,552,856 RAC: 1 |
And seems to be no checkpoint.... :-( |
Send message Joined: 9 Apr 15 Posts: 171 Credit: 1,552,856 RAC: 1 |
GD_jcarro_20160714202312000000_ThirdSimulations_SteadyState8000Schmidt98_conf_1513.xml_0 is at 4.55% after 1 hour 30 minutes. Same here with 8000 version |
Send message Joined: 28 Apr 15 Posts: 18 Credit: 2,472,888 RAC: 0 |
On my 4930k, I've got a bunch of WU's, with some on their 10th hour, and only a third completed. Steady state 8000. |
Send message Joined: 3 Nov 15 Posts: 23 Credit: 2,254,547 RAC: 1 |
GD_jcarro_20160714202312000000_ThirdSimulations_SteadyState8000Schmidt98_conf_1513.xml_0 is at 4.55% after 1 hour 30 minutes. 8-) Once they are convinced that the algorithm is stable, they should be able to double or triple the performance. The Windows code I looked at is using a ton of x87 FP math (ugh). They should be using at least the explicitly enabled SSE2 compiler options. The jobs are spending 60% of the execution time in the "exp" function and a bulk of that time is being spent on the "F2XML st0" instruction which converts the contents of "st0" into 2^st0 - 1 ... microcode cycles like crazy. I doubt that DENIS will make the source code available this time. |
Send message Joined: 9 Apr 15 Posts: 171 Credit: 1,552,856 RAC: 1 |
I doubt that DENIS will make the source code available this time. ? Why? |
Send message Joined: 28 Apr 15 Posts: 18 Credit: 2,472,888 RAC: 0 |
Okay, mine are averaging 16 hours each at this point and haven't finished. There's another thread showing somebody getting 'CPU time exceeded' with 24 hours of work. Admins, do I cancel or what? At this point, 183 CPU hours of work are up for grabs. |
Send message Joined: 3 Nov 15 Posts: 23 Credit: 2,254,547 RAC: 1 |
I doubt that DENIS will make the source code available this time. Just my opinion ... Having multiple copies of the applications floating around introduces additional workload on the project team that they do not have. They are running very thin on resources already. When an "optimized" application generates a result, how do the project members using the answer ... know if the results are "correct" or not. If the computed "answer" is what they expected ... is it really correct? If the computed "answer" is not what they expected ... is it really wrong? |
Send message Joined: 9 Apr 15 Posts: 171 Credit: 1,552,856 RAC: 1 |
At the end of calculation, the wus restarted from 0% with this message: 16/07/2016 16:17:08 | DENIS@Home | Task GD_jcarro_20160714202301000000_ThirdSimulations_SteadyState8000Schmidt98_conf_1210.xml_1 exited with zero status but no 'finished' file I kill this wus. May be the problem is this: - Unhandled Exception Record - |
Send message Joined: 9 Apr 15 Posts: 13 Credit: 9,833,556 RAC: 14 |
I'm not liking these stupidly long running tasks. Project Admin please do something to make them shorter ASAP. |
Send message Joined: 3 Nov 15 Posts: 23 Credit: 2,254,547 RAC: 1 |
Okay, mine are averaging 16 hours each at this point and haven't finished. There's another thread showing somebody getting 'CPU time exceeded' with 24 hours of work. I am running on a (computer # 52542 ) Haswell i7-5930K CPU @ 3.50GHz and I am seeing the results fall into 3 general time buckets so far: 5,500, 11,000 and 24,000 seconds. I installed Windows updates and reboot while they were running and the application "restart" times looked rather funny. State: All (50) · In progress (27) · Validation pending (9) · Validation inconclusive (7) · Valid (4) · Invalid (3) · Error (0) http://denis.usj.es/denisathome/results.php?hostid=52542 |
Send message Joined: 25 Apr 15 Posts: 2 Credit: 57,962 RAC: 0 |
This WU's are to long and with no checkpointing! I stopp all!! |
Send message Joined: 9 Apr 15 Posts: 171 Credit: 1,552,856 RAC: 1 |
When an "optimized" application generates a result, how do the project members using the answer ... know if the results are "correct" or not. The other time, the "combination" of admins+volunteers worked well. I don't know if Sesef and others have to re-write large part of the code to obtain optimization, or if they have to change only little tricks without interfree with scientific part of project. But i know that open source permits to see if there are problems/bugs/etc on code when admins team is little and with not a lot of resources. |
Send message Joined: 10 Apr 15 Posts: 2 Credit: 167,274 RAC: 0 |
Hm, my WUs where where also supposed to run for more than a day. But in the properties checkpointings was shown. I shut down my computer after 4.5 h runtime. When I restarted my computer all WUs jumped to 100% and where uploaded. Let´s see if they will be validated. Marcus |
Send message Joined: 28 Apr 15 Posts: 18 Credit: 2,472,888 RAC: 0 |
I went ahead and let a lot of them just run to their natural completion. Got home, and the ones that finished just restarted from the beginning. They are checkpointing, they now have a much longer ETA. The longest is 628 days. I'm going to have to cancel a LOT of work before it gets back to the server. Between the ones that just finished and rolled over and the ones I'm canceling because they're in the same batch and probably about to roll over, I've wasted 12 days, 16 hours and 14 minutes of CPU time. Admins - if you are going to release long unstable work units, the very least you can do is communicate with your volunteers. This is insulting. |
Send message Joined: 30 Apr 15 Posts: 6 Credit: 551,507 RAC: 0 |
Yeah the WUs are just looping when they reach 100%. I'm cancelling mine. |
Send message Joined: 12 Jun 15 Posts: 5 Credit: 446,005 RAC: 0 |
Like others who've reported in this thread, I got some long-running 8000-series WUs, and I have now aborted them. I have stopped accepting new work from DENIS on all but 1 computer until the problems with these WUs are fixed. On the computer that is still accepting work from DENIS, I will abort any new 8000-series WUs even before they start running, but will run DENIS WUs that are not 8000-series. Before I left my computers for a break of about 9h, the 8000-series WUs were showing over 80% complete after more than 20h run time each. When I returned, they were still running and were showing only 23% and 0.07% complete, so I aborted them. Aborted after 34h elapsed time (31h CPU time) showing as only 23% complete: GD_jcarro_20160714202309000000_ThirdSimulations_SteadyState8000Schmidt98_conf_1441.xml_1 O/S: Windows 7-x64, CPU: Q9650 @ 3.6GHz Aborted after 25h elapsed time (23h CPU time) showing as only 0.07% complete: GD_jcarro_20160714202301000000_ThirdSimulations_SteadyState8000Schmidt98_conf_123.xml_0 O/S: Windows 7-x64, CPU: i7-3770K @ 4.0GHz, HT on. It would seem that these WUs did what boboviz reported above in Post #920: "exited with zero status but no 'finished' file", and restarted from 0%. In my experience on worldcommunitygrid, this error happens when something prevents a WU from sending a heartbeat message to the supervising BOINC client. I can't recall ever seeing this with a WCG WU that's hung internally (eg infinite loop)(1,525,973 WUs completed), but it does happen if the O/S (Windows) pauses execution of a WU for too long due to some high-priority system event like waiting for slow disc activity. It's time that we heard from the scientists on this problem, but I guess they are taking a break over the weekend. But so soon after putting a new version of software and/or a new batch of WUs out here in DC-land, the scientists or techs should log in via the Net a few times over the weekend from wherever they are to check their project. Please, guys! |
Send message Joined: 5 Oct 15 Posts: 17 Credit: 1,335,501 RAC: 0 |
My laptop had an 8000 series unit run for 22 hours and was at 45% complete. Out of curiosity, I restarted Boinc. Sure enough, progress went back to 0% even though runtime was cumulative. This laptop can be shut down multiple times in a 24 hour period, making 8000 units a non-starter. However, another machine cranked through two 8000 units at 32 hours each. They hit 100% and reported successfully. They are now awaiting validation. (Fingers crossed!) I would still like to contribute to Denis. Is it possible to have any of the following? 1) Select the type of work you would like via profile? (e.g. default = short, home = short, medium, long) 2) Optimizations? 3) Bonus points for working a long WU? |
Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0 |
From reading the stderr.txt files in the slots folders of some of my in-progress tasks... It seems that "resuming from checkpoint" is actually "restarting from scratch", on my 8000-series tasks. Devs: Is that a bug? I'm setting No New Tasks for now, as it is nearly impossible for me to guarantee nearly 24 hours of continuous runtime for the tasks, in my heavily-hyperthreaded PC. I need "resuming from checkpoint" to be reliable. |