Very long wus

Message boards : Number crunching : Very long wus
Message board moderation

Author	Message
Jacob Klein Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0	Message 935 - Posted: 17 Jul 2016, 16:52:57 UTC Last modified: 17 Jul 2016, 17:05:13 UTC I just watched a task work its way up to 99.999% progress (with 0:00:01 remaining), then it went to 0% progress (with 1380+ days remaining). So, in addition to "resuming from checkpoint" possibly not working, I also think "exiting normally and gracefully at 100%" is also broken. This is enough for me to set "No New Tasks" for all my PCs, and abort any in-progress DENIS work. In general, the aborted tasks had accumulated about 42 hours of run-time each, before I aborted them. :( DEVS: PLEASE figure these problems out, as you are wasting resources. Thanks, Jacob ID: 935 · Rating: 0 · rate: / Reply Quote

jcastro Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0	Message 936 - Posted: 17 Jul 2016, 18:07:33 UTC Hello everyone, We have found a bug in the long tasks. They overflow the capacity of certain variables in the code ( We have increase the length and the number of variables to modify in order to have better results) We are modifying the code to avoid it. Version v1.07 solves the first overflow which was at initialization. But this new one is during recovering from checkpoint. We will upgrade the application as fast as we can. Best regards, Joel. ID: 936 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0	Message 938 - Posted: 17 Jul 2016, 23:42:27 UTC Thanks, Joel. I hope you get it sorted. I'm subscribed to this thread, so if you drop a note in this thread, when things are working again, I'll be happy to unset "No New Tasks". - Jacob ID: 938 · Rating: 0 · rate: / Reply Quote

zioriga Send message Joined: 15 Apr 15 Posts: 1 Credit: 3,939,246 RAC: 16	Message 939 - Posted: 18 Jul 2016, 5:31:23 UTC so is it better to abort all the WUs in the queue ??? ID: 939 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 9 Apr 15 Posts: 171 Credit: 1,409,551 RAC: 1,699	Message 940 - Posted: 18 Jul 2016, 9:50:51 UTC - in response to Message 939. so is it better to abort all the WUs in the queue ??? "8000 series" for sure. But i hope they validate "750 series" i'm crunching.... ID: 940 · Rating: 0 · rate: / Reply Quote

jcastro Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0	Message 941 - Posted: 18 Jul 2016, 10:41:56 UTC Hi! We have upload new version of CRLP2011 that have that bug fixed. Best regards, Joel ID: 941 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 3 Nov 15 Posts: 22 Credit: 1,885,463 RAC: 2,863	Message 943 - Posted: 18 Jul 2016, 11:21:35 UTC - in response to Message 941. Hi! We have upload new version of CRLP2011 that have that bug fixed. Best regards, Joel Joel, It would be helpful for you to give information with these "new application" messages. Which version did you "upload"? I know what version I am running. Which bug? What should crunchers do with the JOBS in PROGRESS? ID: 943 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0	Message 945 - Posted: 18 Jul 2016, 19:34:11 UTC Last modified: 18 Jul 2016, 19:35:49 UTC Joel: Are you sure resuming-from-checkpoint is working? It seems like it's not! Check out this stderr.txt, from an in-progress task using the 1.08 app, where ... after it was about 80% done, I suspended all tasks, then closed BOINC, the reopened BOINC, then resumed all tasks, and now it looks like it's starting over from 0%. This stderr.txt makes it sound like it doesn't resume from checkpoint at all :( Can you explain please? MName:CRLP2011_EPI MID:6 OpT:3000000.000000 DT:0.002000 OutFreq:50 InT:2999000.000000 NumConstToChange:20 NumStatesToPrint:2 NumAlgToPrint:0 CC ID:16 NAME: G_Na in component Fast_Na_Current VALUE:14.33914484 CC ID:17 NAME: G_Na_B in component Background_Na_Current VALUE:0.000522329628 CC ID:23 NAME: G_Kr in component Rapidly_Activating_K_Current VALUE:0.02741081 CC ID:24 NAME: G_Ks in component Slowly_Activating_K_Current VALUE:0.002511026 CC ID:25 NAME: G_Kp in component Plateau_K_Current VALUE:0.002508008 CC ID:26 NAME: G_to in component Transient_Outward_K_Current VALUE:0.14560234 CC ID:28 NAME: G_K1 in component Inward_Rectifier_K_Current VALUE:0.5212067835 CC ID:29 NAME: G_ClCa in component Ca_Activated_Cl_Current VALUE:0.063621668352 CC ID:31 NAME: G_Cl_B in component Background_Cl_Current VALUE:0.009779112 CC ID:34 NAME: G_Ca in component L_Type_Calcium_Current VALUE:0.00014731931634 CC ID:47 NAME: G_Ca_B in component Background_Ca_Current VALUE:0.0006563656514 CC ID:19 NAME: Ibar_NaK in component Na_K_Pump_Current VALUE:1.07098002 CC ID:44 NAME: Ibar_NCX in component Na_Ca_Exchanger_Current VALUE:3.423105 CC ID:46 NAME: Ibar_PMCA in component Sarcolemmal_Ca_Pump_Current VALUE:0.065910928 CC ID:12 NAME: J_Ca_juncsl in component membrane VALUE:9.7435736118e-13 CC ID:13 NAME: J_Ca_slmyo in component membrane VALUE:3.7469586412e-12 CC ID:60 NAME: k_SR_leak in component SR_Fluxes VALUE:6.47701628e-06 CC ID:55 NAME: ks in component SR_Fluxes VALUE:20.63455 CC ID:58 NAME: V_max_SR_CaP in component SR_Fluxes VALUE:0.00474095564 CC ID:33 NAME: Ca_o in component Calcium_Concentrations VALUE:2.5 STP ID:0 - V in component membrane STP ID:23 - Ca_i in component Calcium_Concentrations CONFIG END Doing CP It:8421908.000000 Doing CP It:16905322.000000 Doing CP It:25084729.000000 Doing CP It:33445063.000000 Doing CP It:42027710.000000 ... ... ... Doing CP It:1248476060.000000 Doing CP It:1256196617.000000MName:CRLP2011_EPI MID:6 OpT:3000000.000000 DT:0.002000 OutFreq:50 InT:2999000.000000 NumConstToChange:20 NumStatesToPrint:2 NumAlgToPrint:0 CC ID:16 NAME: G_Na in component Fast_Na_Current VALUE:14.33914484 CC ID:17 NAME: G_Na_B in component Background_Na_Current VALUE:0.000522329628 CC ID:23 NAME: G_Kr in component Rapidly_Activating_K_Current VALUE:0.02741081 CC ID:24 NAME: G_Ks in component Slowly_Activating_K_Current VALUE:0.002511026 CC ID:25 NAME: G_Kp in component Plateau_K_Current VALUE:0.002508008 CC ID:26 NAME: G_to in component Transient_Outward_K_Current VALUE:0.14560234 CC ID:28 NAME: G_K1 in component Inward_Rectifier_K_Current VALUE:0.5212067835 CC ID:29 NAME: G_ClCa in component Ca_Activated_Cl_Current VALUE:0.063621668352 CC ID:31 NAME: G_Cl_B in component Background_Cl_Current VALUE:0.009779112 CC ID:34 NAME: G_Ca in component L_Type_Calcium_Current VALUE:0.00014731931634 CC ID:47 NAME: G_Ca_B in component Background_Ca_Current VALUE:0.0006563656514 CC ID:19 NAME: Ibar_NaK in component Na_K_Pump_Current VALUE:1.07098002 CC ID:44 NAME: Ibar_NCX in component Na_Ca_Exchanger_Current VALUE:3.423105 CC ID:46 NAME: Ibar_PMCA in component Sarcolemmal_Ca_Pump_Current VALUE:0.065910928 CC ID:12 NAME: J_Ca_juncsl in component membrane VALUE:9.7435736118e-13 CC ID:13 NAME: J_Ca_slmyo in component membrane VALUE:3.7469586412e-12 CC ID:60 NAME: k_SR_leak in component SR_Fluxes VALUE:6.47701628e-06 CC ID:55 NAME: ks in component SR_Fluxes VALUE:20.63455 CC ID:58 NAME: V_max_SR_CaP in component SR_Fluxes VALUE:0.00474095564 CC ID:33 NAME: Ca_o in component Calcium_Concentrations VALUE:2.5 STP ID:0 - V in component membrane STP ID:23 - Ca_i in component Calcium_Concentrations CONFIG END LoadFromCP: it = 0.000000 Doing CP It:7751370.000000 Doing CP It:15470000.000000 ... ... ID: 945 · Rating: 0 · rate: / Reply Quote

Jesús Carro Project administrator Project developer Project scientist Send message Joined: 18 Mar 15 Posts: 271 Credit: 495,633 RAC: 103	Message 946 - Posted: 19 Jul 2016, 8:41:19 UTC - in response to Message 945. Could you send us the link to the task? Thank you! Jesús Carro Universidad San Jorge @InSilicoHeart ID: 946 · Rating: 0 · rate: / Reply Quote

Col323 Send message Joined: 5 Oct 15 Posts: 17 Credit: 1,335,501 RAC: 0	Message 947 - Posted: 19 Jul 2016, 11:53:13 UTC Not sure all bugs are squashed just yet. Last night I was running WU GD_jcarro_20160714201545000000_ThirdSimulations_SteadyState3000Schmidt98_conf_689.xml_2. It was at 8 hours and about 65% complete. This morning, it was at 15 hours and only 5% complete. The machine did not shut off or reboot overnight. This sounds like the "hit 100% and start over" bug others have reported. I verified it was version 1.08. I have aborted the task. ID: 947 · Rating: 0 · rate: / Reply Quote

jcastro Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0	Message 948 - Posted: 19 Jul 2016, 13:10:56 UTC Last modified: 19 Jul 2016, 13:11:31 UTC Hi! It's a strange behavior and in other machines it doesn't appear, so maybe it could be related with some problem of reading/writing the files. We are looking in the report and we will try to solve it as fast as possible. Best regards, Joel ID: 948 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0	Message 949 - Posted: 19 Jul 2016, 13:38:44 UTC - in response to Message 946. Could you send us the link to the task? Thank you! Chus / Joel: Here is a link to the task that appeared to restart-from-beginning several times, each time I closed BOINC. I think resuming-from-checkpoint is broken. Please fix. http://denis.usj.es/denisathome/result.php?resultid=38373900 Thanks, Jacob ID: 949 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 9 Apr 15 Posts: 171 Credit: 1,409,551 RAC: 1,699	Message 950 - Posted: 19 Jul 2016, 15:21:51 UTC No problems with my reboots. Wus restart correctly from checkpoint ID: 950 · Rating: 0 · rate: / Reply Quote

mm67 Send message Joined: 12 Jul 15 Posts: 7 Credit: 43,028,399 RAC: 0	Message 951 - Posted: 19 Jul 2016, 15:56:13 UTC Couple of 1.08 tasks just restarted from beginning and all had this message in log : Task GD_jcarro_20160714202348000000_ThirdSimulations_SteadyState8000Schmidt98_conf_637.xml_2 exited with zero status but no 'finished' file ID: 951 · Rating: 0 · rate: / Reply Quote

jcastro Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0	Message 952 - Posted: 19 Jul 2016, 19:01:14 UTC We are looking some kind of pattern in the bug, so if you could send me more information about your computer and unusual configuration you have it will be really appreciated.(you could send it via email or private message in the forum) Best regards, Joel. ID: 952 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0	Message 953 - Posted: 19 Jul 2016, 19:06:30 UTC Last modified: 19 Jul 2016, 19:09:23 UTC Which bug? I think you have multiple. Regarding the resume-from-checkpoint problem, when you restart BOINC to resume a task, does your stderr.txt (in your slot folder) ever say it resumes correctly? Mine always says: LoadFromCP: it = 0.000000 ... and the progress restarts from 0%, but Elapsed does not reset. Surely you can reproduce that behavior? I would think it should resume (LoadFromCP) from a non-zero-iteration (it = xxx). ID: 953 · Rating: 0 · rate: / Reply Quote

mm67 Send message Joined: 12 Jul 15 Posts: 7 Credit: 43,028,399 RAC: 0	Message 954 - Posted: 19 Jul 2016, 22:23:14 UTC - in response to Message 952. There's absolutely nothing unusual about the rig that restarted those tasks. It's this one : http://denis.usj.es/denisathome/show_host_detail.php?hostid=4788 and this is one of the tasks that restarted : http://denis.usj.es/denisathome/workunit.php?wuid=18823645 ID: 954 · Rating: 0 · rate: / Reply Quote

jcastro Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0	Message 956 - Posted: 20 Jul 2016, 14:22:24 UTC We know that more than one computer is affected. But we have make tests in our windows computer with fresh windows 10 installed inside an it seems to work correctly. So we need more information to reproduce the bug in order to check for it's solution. PD: http://denis.usj.es/denisathome/result.php?resultid=38370457 Best regards, Joel. ID: 956 · Rating: 0 · rate: / Reply Quote

mm67 Send message Joined: 12 Jul 15 Posts: 7 Credit: 43,028,399 RAC: 0	Message 957 - Posted: 20 Jul 2016, 14:33:01 UTC - in response to Message 956. I had the same restarting problems on several Windows and Linux systems, all those tasks had this in their name : SteadyState8000 . Steadystate tasks with numbers other than 8000 had no problems ID: 957 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 8 Apr 15 Posts: 20 Credit: 64,498 RAC: 0	Message 958 - Posted: 20 Jul 2016, 14:37:11 UTC Last modified: 20 Jul 2016, 14:39:10 UTC Edit: Yeah, SteadyState8000 seems to be the problematic ones - make sure you are testing those. Well, I'm using an Insider Build of Windows 10, the latest fast-ring Build 14393. And I've seen the behavior on multiple PCs, all running Insider Builds, not just one PC. You can use Settings to sign up for the Insider Builds, and it takes 24-48 hours to get offered the build. While it's possible the Insider build might be a potential cause, I think it's unlikely, unless the app itself isn't executing correctly using the new build (ie: an app problem). Is there anything special the app does, when it tries to resume from checkpoint? ID: 958 · Rating: 0 · rate: / Reply Quote

Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Very long wus