Very long wus
Message boards :
Number crunching :
Very long wus
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0 |
Maybe all the SteadyState ones are affected?? Make sure to test SteadState tasks. Here's a SteadyState1000, that also didn't resume from checkpoint correctly: http://denis.usj.es/denisathome/result.php?resultid=38373900 Here's a SteadyState2000, that also didn't resume from checkpoint correctly: http://denis.usj.es/denisathome/result.php?resultid=38322366 Here's a SteadyState3000, that also didn't resume from checkpoint correctly: http://denis.usj.es/denisathome/result.php?resultid=38325794 |
Send message Joined: 5 Oct 15 Posts: 17 Credit: 1,335,501 RAC: 0 |
Well, I've picked up GD_jcarro_20160714202359000000_ThirdSimulations_SteadyState8000Schmidt98_conf_916.xml_7 (Yes, _7.) Units 0-6 are a mixed crew of 1 pending valid after 147,000 seconds, 2 Errors while computing, and 4 aborted by user. _6 was actually aborted by mm67 after 43,863.05 seconds, so I'm guessing it's another self-restarted/failed checkpoint restart WU. Since I don't think this machine is going to reboot in the next 36 hours, I'll see what happens. (For what it's worth, it's running Win 7.) |
Send message Joined: 28 Apr 15 Posts: 18 Credit: 2,472,888 RAC: 0 |
And my tasks were on BOTH Windows 10 and Windows Vista (don't ask ><). |
Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0 |
The app is not doing anything special than store values of the current simulation. We have had to change the size of one parameter to store it as double because with longest simulations we overflow the size. I think that could be related to the reading of the file. Because when it get rubbish from the checkpoint files ( is corrupted for example) the shows a it==XXX different from zero. But if the app can't read the file or it doesn't exists, it starts from zero. We will do some test using parameters of the SteadyState8000 that seems to be more faulty if we can reproduce the problem. Thanks for the info that you give us, is incredibly important in a project like DENIS@Home. Best regards, Joel. |
Send message Joined: 5 Oct 15 Posts: 17 Credit: 1,335,501 RAC: 0 |
I appreciate you looking into this and your responsiveness. Please don't take my multiple postings about errors as complaining; I am hoping to be helpful. I understand that failed experiments are part of science. I will keep my machines attached and hope that we can iron out all bugs. Then hopefully we can get more people on board and do more science! |
Send message Joined: 18 Mar 15 Posts: 284 Credit: 2,748,608 RAC: 0 |
|
Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0 |
Here's another v1.08 task that didn't function correctly. http://denis.usj.es/denisathome/result.php?resultid=38374146 It looks like it: - Checkpointed up to iteration: 1484766329 - I (presume that I) exited BOINC - Loaded from checkpoint iteration: 1484766329 - Checkpointed up to iteration: 8456372551 - Called boinc_finish(0) - Did not exit gracefully - Continued by restarting the task all over, from iteration: 0 - Checkpointed up to iteration: 1204232792 - I aborted it. I have again set "No New Tasks" for this project, since otherwise it's wasting my CPU resources. I hope you can do extra testing, to get a better handle on solving your issues. For your reference, I do use the option "Leave non-GPU tasks in memory while suspended", and generally I use Activity->Suspend, before exiting BOINC. The PC that this task was aborted on, is on Windows 10 Insider Build 14393, with BOINC installed as a service. I'm attached to 50+ projects, get work from about 10-15 of them, and yours is the only one not working. Good luck fixing it. I'll be monitoring this thread. Thanks, Jacob Klein |
Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0 |
Here's another: http://denis.usj.es/denisathome/result.php?resultid=38376733 After a couple "unnecessary full restarts", I let it run for a very long time continuously. It eventually called "boinc_finish(0)", and I watched it go from 100% to 0%, and keep wasting my CPU. I then aborted it. Very frustrating. I hope you can figure it out. |
Send message Joined: 5 Oct 15 Posts: 17 Credit: 1,335,501 RAC: 0 |
Well, I've picked up GD_jcarro_20160714202359000000_ThirdSimulations_SteadyState8000Schmidt98_conf_916.xml_7 (Yes, _7.) Units 0-6 are a mixed crew of 1 pending valid after 147,000 seconds, 2 Errors while computing, and 4 aborted by user. _6 was actually aborted by mm67 after 43,863.05 seconds, so I'm guessing it's another self-restarted/failed checkpoint restart WU. Since I don't think this machine is going to reboot in the next 36 hours, I'll see what happens. Well, after 34+ hours, it reset back to 0%. This WU will be aborted. |
Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0 |
Hello, We know how frustrating it is. We have stop new simulations on CRLP2011, and we are going to do some work on the BETA. To test if the problem is solved. We are going also to see if we can modify the credit system to give you credit for the computation time and not just for results. Your work as volunteer is great and we will see a way to reward your effort. Best regards, Joel. |
Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0 |
Thanks, Joel. You are making the right call, in regards to both suspending work and granting credits. Please note that I don't care at all about the credits. I care about my CPU cycles not being wasted. Once you have something for us to test, be sure to let us know in this thread, and I'd be more than happy to unset "No New Tasks" to help test! - Jacob |
Send message Joined: 5 Oct 15 Posts: 17 Credit: 1,335,501 RAC: 0 |
I also just aborted WU GD_jcarro_20160714201506000000_ThirdSimulations_SteadyState3000Schmidt98_conf_1473.xml_5 which restarted from 0% after 20+ hours. This is the same machine where I aborted the 8000 WU above. Here are the system specs and other info from Boinc startup: 7/22/2016 7:33:04 AM | | Starting BOINC client version 7.2.47 for windows_x86_64 7/22/2016 7:33:04 AM | | log flags: file_xfer, sched_ops, task 7/22/2016 7:33:04 AM | | Libraries: libcurl/7.25.0 OpenSSL/1.0.1 zlib/1.2.6 7/22/2016 7:33:04 AM | | Data directory: C:\ProgramData\BOINC 7/22/2016 7:33:04 AM | | OpenCL: Intel GPU 0: Intel(R) HD Graphics 5500 (driver version 10.18.14.4029, device version OpenCL 2.0, 1298MB, 1298MB available, 58 GFLOPS peak) 7/22/2016 7:33:04 AM | | OpenCL CPU: Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz (OpenCL driver vendor: Intel(R) Corporation, driver version 4.2.0.130, device version OpenCL 2.0 (Build 130)) 7/22/2016 7:33:04 AM | | Processor: 4 GenuineIntel Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz [Family 6 Model 61 Stepping 4] 7/22/2016 7:33:04 AM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes nx lm vmx smx tm2 pbe 7/22/2016 7:33:04 AM | | OS: Microsoft Windows 7: Professional x64 Edition, Service Pack 1, (06.01.7601.00) 7/22/2016 7:33:04 AM | | Memory: 7.69 GB physical, 15.37 GB virtual 7/22/2016 7:33:04 AM | | Disk: 238.47 GB total, 126.87 GB free 7/22/2016 7:33:04 AM | DENIS@Home | URL http://denis.usj.es/denisathome/; Computer ID 62844; resource share 0 7/22/2016 7:33:04 AM | WUProp@Home | URL http://wuprop.boinc-af.org/; Computer ID 89286; resource share 100 7/22/2016 7:33:04 AM | malariacontrol.net | URL http://www.malariacontrol.net/; Computer ID 1667973; resource share 0 7/22/2016 7:33:04 AM | World Community Grid | URL http://www.worldcommunitygrid.org/; Computer ID 3454626; resource share 0 7/22/2016 7:33:04 AM | World Community Grid | General prefs: from World Community Grid (last modified 04-Dec-2015 22:34:24) 7/22/2016 7:33:04 AM | World Community Grid | Computer location: school 7/22/2016 7:33:04 AM | | General prefs: using separate prefs for school 7/22/2016 7:33:04 AM | | Preferences: 7/22/2016 7:33:04 AM | | max memory usage when active: 5904.19MB 7/22/2016 7:33:04 AM | | max memory usage when idle: 7085.03MB 7/22/2016 7:33:04 AM | | max disk usage: 100.00GB 7/22/2016 7:33:04 AM | | (to change preferences, visit a project web site or select Preferences in the Manager) 7/22/2016 7:33:04 AM | | Not using a proxy In good news, this machine has successfully completed and validated two WUs in the last 24 hours: GD_jcarro_20160714201352000000_ThirdSimulations_SteadyState2000Schmidt98_conf_1329.xml_4 GD_jcarro_20160714202241000000_ThirdSimulations_SteadyState750Schmidt98_conf_697.xml_3 Not all is futile! And thank you for looking into credit for runtime. Like Jacob, points are not my main motivation for this, but it's nice to know that the project admins feel our pain. I will keep the gates open and look forward to more Beta units. |
Send message Joined: 9 Apr 15 Posts: 172 Credit: 1,552,856 RAC: 0 |
We have stop new simulations on CRLP2011, and we are going to do some work on the BETA. The beta is public? Can we help you?? |
Send message Joined: 6 Nov 15 Posts: 16 Credit: 325,209 RAC: 0 |
Two tasks in progress for me that currently look like my run time will be around 30 hours each, even though they were much faster for wingmates. http://denis.usj.es/denisathome/result.php?resultid=38327994 http://denis.usj.es/denisathome/result.php?resultid=38327983 |
Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0 |
Robert: Take a peek at the stderr.txt files in the slots folders for each of the tasks. Does it look like it started over on its own yet -- Called boinc_finish(0), then restarted all over, from iteration: 0 ? If so, then the task already looped on you. |
Send message Joined: 6 Nov 15 Posts: 16 Credit: 325,209 RAC: 0 |
Robert: The one that's run the longest (1d 03:36:13 so far) has started from iteration 0 three times, with boinc_finish(0) only for the second time. Appears to have written many checkpoints, but never even tried to restart from any of them. The other one (08:00:33 so far) had a LoadFromCP: line and many Doing CP It: lines (neither appeared for the other task); it might have started over once with this line: Doing CP It:1141721694.000000MName:CRLP2011_EPI The LoadFromCP: line and many Doing CP It: lines appeared only after this line. For example: CONFIG END LoadFromCP: it = -6110892645033682256337252711852045995486982415347792110886177594213665993470218171985733728434033033587033496356699725256440570264016656431698385745737884464543906958658299879007483863999332353110867870092786887207335856942329392643006019576502181256283885945035465978876472831944354395374476697335574440100913969004208979437429562438677774977071428939853534541475566471267849288046304198734754166816489965632689457386820297658428675099090853412911405636771391450031250674115123612619557337130249336382937620409401265463957689022391515650389931161503345110237497753094939731649355659964714257267804169098109870728723040764861336992280276580898652295780437425500632570729433442377566631411950434096222547411386458811123191812111714332983839623878205988539575062909922470855412160084978819241882902718261716530388263900607241152634232759461569517343048983412070339426621470310714167931097596516890271446417098644016034660822574004287594162691221897348942721047501664150162228770868255706100368034040796435428828639038807423772798402936065423263483411447648897941541627810012386044507472646611100208405994844963076102531395747872299212095676624931764997195519296319715353220851595186953683263643824775496487963807746118653593060511074576561528025103706232301679246824056526208460350640570487199446185779603308357521660409037683723063944453248295021859025145645448793464757479007787819804029912671551324780314579610854877520942362471275800099082258138711914204001771932453599568925919340157672561099980018471234030290184323011182124536568585195789339121533055037884541379395543103017665986423621168653843928286983804063138055958293537749962652372790837532067922652962358211690051325939381235025568122757932355165528688971988072450778342090576020207125710955595761829836850797441693285356857911814932360859682379609480974874935850921286246110592571222216116133627790444933668553883249905913831378094019158539908961899466489238569224183509003239792185068178413230751682438124167267860558336731748539133663243651862058957234962281361100085052373855738773432972529285124458453311576088997489353933995968566904939637411159037884497920.000000 Doing CP It:-6110892645033682256337252711852045995486982415347792110886177594213665993470218171985733728434033033587033496356699725256440570264016656431698385745737884464543906958658299879007483863999332353110867870092786887207335856942329392643006019576502181256283885945035465978876472831944354395374476697335574440100913969004208979437429562438677774977071428939853534541475566471267849288046304198734754166816489965632689457386820297658428675099090853412911405636771391450031250674115123612619557337130249336382937620409401265463957689022391515650389931161503345110237497753094939731649355659964714257267804169098109870728723040764861336992280276580898652295780437425500632570729433442377566631411950434096222547411386458811123191812111714332983839623878205988539575062909922470855412160084978819241882902718261716530388263900607241152634232759461569517343048983412070339426621470310714167931097596516890271446417098644016034660822574004287594162691221897348942721047501664150162228770868255706100368034040796435428828639038807423772798402936065423263483411447648897941541627810012386044507472646611100208405994844963076102531395747872299212095676624931764997195519296319715353220851595186953683263643824775496487963807746118653593060511074576561528025103706232301679246824056526208460350640570487199446185779603308357521660409037683723063944453248295021859025145645448793464757479007787819804029912671551324780314579610854877520942362471275800099082258138711914204001771932453599568925919340157672561099980018471234030290184323011182124536568585195789339121533055037884541379395543103017665986423621168653843928286983804063138055958293537749962652372790837532067922652962358211690051325939381235025568122757932355165528688971988072450778342090576020207125710955595761829836850797441693285356857911814932360859682379609480974874935850921286246110592571222216116133627790444933668553883249905913831378094019158539908961899466489238569224183509003239792185068178413230751682438124167267860558336731748539133663243651862058957234962281361100085052373855738773432972529285124458453311576088997489353933995968566904939637411159037884497920.000000 Doing CP It:-6110892645033682256337252711852045995486982415347792110886177594213665993470218171985733728434033033587033496356699725256440570264016656431698385745737884464543906958658299879007483863999332353110867870092786887207335856942329392643006019576502181256283885945035465978876472831944354395374476697335574440100913969004208979437429562438677774977071428939853534541475566471267849288046304198734754166816489965632689457386820297658428675099090853412911405636771391450031250674115123612619557337130249336382937620409401265463957689022391515650389931161503345110237497753094939731649355659964714257267804169098109870728723040764861336992280276580898652295780437425500632570729433442377566631411950434096222547411386458811123191812111714332983839623878205988539575062909922470855412160084978819241882902718261716530388263900607241152634232759461569517343048983412070339426621470310714167931097596516890271446417098644016034660822574004287594162691221897348942721047501664150162228770868255706100368034040796435428828639038807423772798402936065423263483411447648897941541627810012386044507472646611100208405994844963076102531395747872299212095676624931764997195519296319715353220851595186953683263643824775496487963807746118653593060511074576561528025103706232301679246824056526208460350640570487199446185779603308357521660409037683723063944453248295021859025145645448793464757479007787819804029912671551324780314579610854877520942362471275800099082258138711914204001771932453599568925919340157672561099980018471234030290184323011182124536568585195789339121533055037884541379395543103017665986423621168653843928286983804063138055958293537749962652372790837532067922652962358211690051325939381235025568122757932355165528688971988072450778342090576020207125710955595761829836850797441693285356857911814932360859682379609480974874935850921286246110592571222216116133627790444933668553883249905913831378094019158539908961899466489238569224183509003239792185068178413230751682438124167267860558336731748539133663243651862058957234962281361100085052373855738773432972529285124458453311576088997489353933995968566904939637411159037884497920.000000 I'm unable to tell whether those lines indicate valid iteration numbers or not. Neither task has anything I recognize as a checkpoint file. |
Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0 |
:) That one with the large negative iteration numbers ... looks like a v1.07 task, where they had problems where the iteration number was being stored in a variable that was too small (hence the large negative numbers). You need to cancel your v1.07 tasks, and get some v1.08 tasks. But, warning, when you get the v1.08 tasks, I'm betting you'll possibly encounter the same issues some of us are seeing -- 1) not resuming from checkpoints, and 2) looping instead of completing. I've set "No New Tasks". Devs are trying to solve those problems, listed throughout this thread. |
Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0 |
We have stop new simulations on CRLP2011, and we are going to do some work on the BETA. To help us with the beta, you have to mark "Run test applications?" in "DENIS@Home Preferences" in your profile. We use this separate app to launch few simulations with strange parameters or new features that we want to test before add it into the "stable/official" DENIS simulator. Best regards, Joel. |
Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0 |
I have changed my web setting, to allow Beta, so I can help test. It downloaded some "Beta v1.03" tasks. Is this a new executable (with potential fixes), or an old one (from a prior beta test)? Also, you should consider putting release notes in the News section, for each public and each beta release. Edit: Checkpointing still is not working, on the Beta, per below. Dang. Doing CP It:8630444.000000 Doing CP It:17270947.000000 Doing CP It:25836125.000000 Doing CP It:34510980.000000 ... restarted BOINC ... Checkpoint file not found Doing CP It:8609762.000000 |
Send message Joined: 23 Dec 15 Posts: 6 Credit: 105,476 RAC: 0 |
Beta v1.03 tasks: 20h runtime, not bad... |