Very long wus
Message boards :
Number crunching :
Very long wus
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0 |
I just watched a task work its way up to 99.999% progress (with 0:00:01 remaining), then it went to 0% progress (with 1380+ days remaining). So, in addition to "resuming from checkpoint" possibly not working, I also think "exiting normally and gracefully at 100%" is also broken. This is enough for me to set "No New Tasks" for all my PCs, and abort any in-progress DENIS work. In general, the aborted tasks had accumulated about 42 hours of run-time each, before I aborted them. :( DEVS: PLEASE figure these problems out, as you are wasting resources. Thanks, Jacob |
Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0 |
Hello everyone, We have found a bug in the long tasks. They overflow the capacity of certain variables in the code ( We have increase the length and the number of variables to modify in order to have better results) We are modifying the code to avoid it. Version v1.07 solves the first overflow which was at initialization. But this new one is during recovering from checkpoint. We will upgrade the application as fast as we can. Best regards, Joel. |
Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0 |
Thanks, Joel. I hope you get it sorted. I'm subscribed to this thread, so if you drop a note in this thread, when things are working again, I'll be happy to unset "No New Tasks". - Jacob |
Send message Joined: 15 Apr 15 Posts: 1 Credit: 4,095,591 RAC: 0 |
so is it better to abort all the WUs in the queue ??? |
Send message Joined: 9 Apr 15 Posts: 172 Credit: 1,552,856 RAC: 0 |
so is it better to abort all the WUs in the queue ??? "8000 series" for sure. But i hope they validate "750 series" i'm crunching.... |
Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0 |
Hi! We have upload new version of CRLP2011 that have that bug fixed. Best regards, Joel |
Send message Joined: 3 Nov 15 Posts: 23 Credit: 2,254,547 RAC: 0 |
Hi! Joel, It would be helpful for you to give information with these "new application" messages. Which version did you "upload"? I know what version I am running. Which bug? What should crunchers do with the JOBS in PROGRESS? |
Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0 |
Joel: Are you sure resuming-from-checkpoint is working? It seems like it's not! Check out this stderr.txt, from an in-progress task using the 1.08 app, where ... after it was about 80% done, I suspended all tasks, then closed BOINC, the reopened BOINC, then resumed all tasks, and now it looks like it's starting over from 0%. This stderr.txt makes it sound like it doesn't resume from checkpoint at all :( Can you explain please? MName:CRLP2011_EPI MID:6 OpT:3000000.000000 DT:0.002000 OutFreq:50 InT:2999000.000000 NumConstToChange:20 NumStatesToPrint:2 NumAlgToPrint:0 CC ID:16 NAME: G_Na in component Fast_Na_Current VALUE:14.33914484 CC ID:17 NAME: G_Na_B in component Background_Na_Current VALUE:0.000522329628 CC ID:23 NAME: G_Kr in component Rapidly_Activating_K_Current VALUE:0.02741081 CC ID:24 NAME: G_Ks in component Slowly_Activating_K_Current VALUE:0.002511026 CC ID:25 NAME: G_Kp in component Plateau_K_Current VALUE:0.002508008 CC ID:26 NAME: G_to in component Transient_Outward_K_Current VALUE:0.14560234 CC ID:28 NAME: G_K1 in component Inward_Rectifier_K_Current VALUE:0.5212067835 CC ID:29 NAME: G_ClCa in component Ca_Activated_Cl_Current VALUE:0.063621668352 CC ID:31 NAME: G_Cl_B in component Background_Cl_Current VALUE:0.009779112 CC ID:34 NAME: G_Ca in component L_Type_Calcium_Current VALUE:0.00014731931634 CC ID:47 NAME: G_Ca_B in component Background_Ca_Current VALUE:0.0006563656514 CC ID:19 NAME: Ibar_NaK in component Na_K_Pump_Current VALUE:1.07098002 CC ID:44 NAME: Ibar_NCX in component Na_Ca_Exchanger_Current VALUE:3.423105 CC ID:46 NAME: Ibar_PMCA in component Sarcolemmal_Ca_Pump_Current VALUE:0.065910928 CC ID:12 NAME: J_Ca_juncsl in component membrane VALUE:9.7435736118e-13 CC ID:13 NAME: J_Ca_slmyo in component membrane VALUE:3.7469586412e-12 CC ID:60 NAME: k_SR_leak in component SR_Fluxes VALUE:6.47701628e-06 CC ID:55 NAME: ks in component SR_Fluxes VALUE:20.63455 CC ID:58 NAME: V_max_SR_CaP in component SR_Fluxes VALUE:0.00474095564 CC ID:33 NAME: Ca_o in component Calcium_Concentrations VALUE:2.5 STP ID:0 - V in component membrane STP ID:23 - Ca_i in component Calcium_Concentrations CONFIG END Doing CP It:8421908.000000 Doing CP It:16905322.000000 Doing CP It:25084729.000000 Doing CP It:33445063.000000 Doing CP It:42027710.000000 ... ... ... Doing CP It:1248476060.000000 Doing CP It:1256196617.000000MName:CRLP2011_EPI MID:6 OpT:3000000.000000 DT:0.002000 OutFreq:50 InT:2999000.000000 NumConstToChange:20 NumStatesToPrint:2 NumAlgToPrint:0 CC ID:16 NAME: G_Na in component Fast_Na_Current VALUE:14.33914484 CC ID:17 NAME: G_Na_B in component Background_Na_Current VALUE:0.000522329628 CC ID:23 NAME: G_Kr in component Rapidly_Activating_K_Current VALUE:0.02741081 CC ID:24 NAME: G_Ks in component Slowly_Activating_K_Current VALUE:0.002511026 CC ID:25 NAME: G_Kp in component Plateau_K_Current VALUE:0.002508008 CC ID:26 NAME: G_to in component Transient_Outward_K_Current VALUE:0.14560234 CC ID:28 NAME: G_K1 in component Inward_Rectifier_K_Current VALUE:0.5212067835 CC ID:29 NAME: G_ClCa in component Ca_Activated_Cl_Current VALUE:0.063621668352 CC ID:31 NAME: G_Cl_B in component Background_Cl_Current VALUE:0.009779112 CC ID:34 NAME: G_Ca in component L_Type_Calcium_Current VALUE:0.00014731931634 CC ID:47 NAME: G_Ca_B in component Background_Ca_Current VALUE:0.0006563656514 CC ID:19 NAME: Ibar_NaK in component Na_K_Pump_Current VALUE:1.07098002 CC ID:44 NAME: Ibar_NCX in component Na_Ca_Exchanger_Current VALUE:3.423105 CC ID:46 NAME: Ibar_PMCA in component Sarcolemmal_Ca_Pump_Current VALUE:0.065910928 CC ID:12 NAME: J_Ca_juncsl in component membrane VALUE:9.7435736118e-13 CC ID:13 NAME: J_Ca_slmyo in component membrane VALUE:3.7469586412e-12 CC ID:60 NAME: k_SR_leak in component SR_Fluxes VALUE:6.47701628e-06 CC ID:55 NAME: ks in component SR_Fluxes VALUE:20.63455 CC ID:58 NAME: V_max_SR_CaP in component SR_Fluxes VALUE:0.00474095564 CC ID:33 NAME: Ca_o in component Calcium_Concentrations VALUE:2.5 STP ID:0 - V in component membrane STP ID:23 - Ca_i in component Calcium_Concentrations CONFIG END LoadFromCP: it = 0.000000 Doing CP It:7751370.000000 Doing CP It:15470000.000000 ... ... |
Send message Joined: 18 Mar 15 Posts: 284 Credit: 2,748,608 RAC: 0 |
|
Send message Joined: 5 Oct 15 Posts: 17 Credit: 1,335,501 RAC: 0 |
Not sure all bugs are squashed just yet. Last night I was running WU GD_jcarro_20160714201545000000_ThirdSimulations_SteadyState3000Schmidt98_conf_689.xml_2. It was at 8 hours and about 65% complete. This morning, it was at 15 hours and only 5% complete. The machine did not shut off or reboot overnight. This sounds like the "hit 100% and start over" bug others have reported. I verified it was version 1.08. I have aborted the task. |
Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0 |
Hi! It's a strange behavior and in other machines it doesn't appear, so maybe it could be related with some problem of reading/writing the files. We are looking in the report and we will try to solve it as fast as possible. Best regards, Joel |
Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0 |
Could you send us the link to the task? Thank you! Chus / Joel: Here is a link to the task that appeared to restart-from-beginning several times, each time I closed BOINC. I think resuming-from-checkpoint is broken. Please fix. http://denis.usj.es/denisathome/result.php?resultid=38373900 Thanks, Jacob |
Send message Joined: 9 Apr 15 Posts: 172 Credit: 1,552,856 RAC: 0 |
No problems with my reboots. Wus restart correctly from checkpoint |
Send message Joined: 12 Jul 15 Posts: 7 Credit: 43,028,399 RAC: 0 |
Couple of 1.08 tasks just restarted from beginning and all had this message in log : Task GD_jcarro_20160714202348000000_ThirdSimulations_SteadyState8000Schmidt98_conf_637.xml_2 exited with zero status but no 'finished' file |
Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0 |
We are looking some kind of pattern in the bug, so if you could send me more information about your computer and unusual configuration you have it will be really appreciated.(you could send it via email or private message in the forum) Best regards, Joel. |
Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0 |
Which bug? I think you have multiple. Regarding the resume-from-checkpoint problem, when you restart BOINC to resume a task, does your stderr.txt (in your slot folder) ever say it resumes correctly? Mine always says: LoadFromCP: it = 0.000000... and the progress restarts from 0%, but Elapsed does not reset. Surely you can reproduce that behavior? I would think it should resume (LoadFromCP) from a non-zero-iteration (it = xxx). |
Send message Joined: 12 Jul 15 Posts: 7 Credit: 43,028,399 RAC: 0 |
There's absolutely nothing unusual about the rig that restarted those tasks. It's this one : http://denis.usj.es/denisathome/show_host_detail.php?hostid=4788 and this is one of the tasks that restarted : http://denis.usj.es/denisathome/workunit.php?wuid=18823645 |
Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0 |
We know that more than one computer is affected. But we have make tests in our windows computer with fresh windows 10 installed inside an it seems to work correctly. So we need more information to reproduce the bug in order to check for it's solution. PD: http://denis.usj.es/denisathome/result.php?resultid=38370457 Best regards, Joel. |
Send message Joined: 12 Jul 15 Posts: 7 Credit: 43,028,399 RAC: 0 |
I had the same restarting problems on several Windows and Linux systems, all those tasks had this in their name : SteadyState8000 . Steadystate tasks with numbers other than 8000 had no problems |
Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0 |
Edit: Yeah, SteadyState8000 seems to be the problematic ones - make sure you are testing those. Well, I'm using an Insider Build of Windows 10, the latest fast-ring Build 14393. And I've seen the behavior on multiple PCs, all running Insider Builds, not just one PC. You can use Settings to sign up for the Insider Builds, and it takes 24-48 hours to get offered the build. While it's possible the Insider build might be a potential cause, I think it's unlikely, unless the app itself isn't executing correctly using the new build (ie: an app problem). Is there anything special the app does, when it tries to resume from checkpoint? |