Very long wus

Message boards : Number crunching : Very long wus
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Jacob Klein

Send message
Joined: 8 Apr 15
Posts: 20
Credit: 64,498
RAC: 25
Message 935 - Posted: 17 Jul 2016, 16:52:57 UTC
Last modified: 17 Jul 2016, 17:05:13 UTC

I just watched a task work its way up to 99.999% progress (with 0:00:01 remaining), then it went to 0% progress (with 1380+ days remaining). So, in addition to "resuming from checkpoint" possibly not working, I also think "exiting normally and gracefully at 100%" is also broken.

This is enough for me to set "No New Tasks" for all my PCs, and abort any in-progress DENIS work. In general, the aborted tasks had accumulated about 42 hours of run-time each, before I aborted them. :(

DEVS:
PLEASE figure these problems out, as you are wasting resources.

Thanks,
Jacob
ID: 935 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile jcastro
Avatar

Send message
Joined: 16 Mar 15
Posts: 219
Credit: 14,859
RAC: 0
Message 936 - Posted: 17 Jul 2016, 18:07:33 UTC

Hello everyone,

We have found a bug in the long tasks. They overflow the capacity of certain variables in the code ( We have increase the length and the number of variables to modify in order to have better results)

We are modifying the code to avoid it. Version v1.07 solves the first overflow which was at initialization. But this new one is during recovering from checkpoint. We will upgrade the application as fast as we can.

Best regards, Joel.
ID: 936 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 8 Apr 15
Posts: 20
Credit: 64,498
RAC: 25
Message 938 - Posted: 17 Jul 2016, 23:42:27 UTC

Thanks, Joel. I hope you get it sorted.

I'm subscribed to this thread, so if you drop a note in this thread, when things are working again, I'll be happy to unset "No New Tasks".

- Jacob
ID: 938 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zioriga

Send message
Joined: 15 Apr 15
Posts: 1
Credit: 3,325,251
RAC: 494
Message 939 - Posted: 18 Jul 2016, 5:31:23 UTC

so is it better to abort all the WUs in the queue ???
ID: 939 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[VENETO] boboviz

Send message
Joined: 9 Apr 15
Posts: 154
Credit: 612,522
RAC: 3,083
Message 940 - Posted: 18 Jul 2016, 9:50:51 UTC - in response to Message 939.  

so is it better to abort all the WUs in the queue ???


"8000 series" for sure.
But i hope they validate "750 series" i'm crunching....
ID: 940 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile jcastro
Avatar

Send message
Joined: 16 Mar 15
Posts: 219
Credit: 14,859
RAC: 0
Message 941 - Posted: 18 Jul 2016, 10:41:56 UTC

Hi!

We have upload new version of CRLP2011 that have that bug fixed.

Best regards, Joel
ID: 941 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rjs5

Send message
Joined: 3 Nov 15
Posts: 22
Credit: 1,089,247
RAC: 7,979
Message 943 - Posted: 18 Jul 2016, 11:21:35 UTC - in response to Message 941.  

Hi!

We have upload new version of CRLP2011 that have that bug fixed.

Best regards, Joel


Joel,
It would be helpful for you to give information with these "new application" messages.

Which version did you "upload"? I know what version I am running.
Which bug?
What should crunchers do with the JOBS in PROGRESS?
ID: 943 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 8 Apr 15
Posts: 20
Credit: 64,498
RAC: 25
Message 945 - Posted: 18 Jul 2016, 19:34:11 UTC
Last modified: 18 Jul 2016, 19:35:49 UTC

Joel:

Are you sure resuming-from-checkpoint is working? It seems like it's not!

Check out this stderr.txt, from an in-progress task using the 1.08 app, where ... after it was about 80% done, I suspended all tasks, then closed BOINC, the reopened BOINC, then resumed all tasks, and now it looks like it's starting over from 0%. This stderr.txt makes it sound like it doesn't resume from checkpoint at all :( Can you explain please?

MName:CRLP2011_EPI
MID:6
OpT:3000000.000000
DT:0.002000
OutFreq:50
InT:2999000.000000
NumConstToChange:20
NumStatesToPrint:2
NumAlgToPrint:0
CC ID:16 NAME: G_Na in component Fast_Na_Current VALUE:14.33914484
CC ID:17 NAME: G_Na_B in component Background_Na_Current VALUE:0.000522329628
CC ID:23 NAME: G_Kr in component Rapidly_Activating_K_Current VALUE:0.02741081
CC ID:24 NAME: G_Ks in component Slowly_Activating_K_Current VALUE:0.002511026
CC ID:25 NAME: G_Kp in component Plateau_K_Current VALUE:0.002508008
CC ID:26 NAME: G_to in component Transient_Outward_K_Current VALUE:0.14560234
CC ID:28 NAME: G_K1 in component Inward_Rectifier_K_Current VALUE:0.5212067835
CC ID:29 NAME: G_ClCa in component Ca_Activated_Cl_Current VALUE:0.063621668352
CC ID:31 NAME: G_Cl_B in component Background_Cl_Current VALUE:0.009779112
CC ID:34 NAME: G_Ca in component L_Type_Calcium_Current VALUE:0.00014731931634
CC ID:47 NAME: G_Ca_B in component Background_Ca_Current VALUE:0.0006563656514
CC ID:19 NAME: Ibar_NaK in component Na_K_Pump_Current VALUE:1.07098002
CC ID:44 NAME: Ibar_NCX in component Na_Ca_Exchanger_Current VALUE:3.423105
CC ID:46 NAME: Ibar_PMCA in component Sarcolemmal_Ca_Pump_Current VALUE:0.065910928
CC ID:12 NAME: J_Ca_juncsl in component membrane VALUE:9.7435736118e-13
CC ID:13 NAME: J_Ca_slmyo in component membrane VALUE:3.7469586412e-12
CC ID:60 NAME: k_SR_leak in component SR_Fluxes VALUE:6.47701628e-06
CC ID:55 NAME: ks in component SR_Fluxes VALUE:20.63455
CC ID:58 NAME: V_max_SR_CaP in component SR_Fluxes VALUE:0.00474095564
CC ID:33 NAME: Ca_o in component Calcium_Concentrations VALUE:2.5
STP ID:0 - V in component membrane
STP ID:23 - Ca_i in component Calcium_Concentrations
CONFIG END

Doing CP It:8421908.000000
Doing CP It:16905322.000000
Doing CP It:25084729.000000
Doing CP It:33445063.000000
Doing CP It:42027710.000000
...
...
...
Doing CP It:1248476060.000000
Doing CP It:1256196617.000000MName:CRLP2011_EPI
MID:6
OpT:3000000.000000
DT:0.002000
OutFreq:50
InT:2999000.000000
NumConstToChange:20
NumStatesToPrint:2
NumAlgToPrint:0
CC ID:16 NAME: G_Na in component Fast_Na_Current VALUE:14.33914484
CC ID:17 NAME: G_Na_B in component Background_Na_Current VALUE:0.000522329628
CC ID:23 NAME: G_Kr in component Rapidly_Activating_K_Current VALUE:0.02741081
CC ID:24 NAME: G_Ks in component Slowly_Activating_K_Current VALUE:0.002511026
CC ID:25 NAME: G_Kp in component Plateau_K_Current VALUE:0.002508008
CC ID:26 NAME: G_to in component Transient_Outward_K_Current VALUE:0.14560234
CC ID:28 NAME: G_K1 in component Inward_Rectifier_K_Current VALUE:0.5212067835
CC ID:29 NAME: G_ClCa in component Ca_Activated_Cl_Current VALUE:0.063621668352
CC ID:31 NAME: G_Cl_B in component Background_Cl_Current VALUE:0.009779112
CC ID:34 NAME: G_Ca in component L_Type_Calcium_Current VALUE:0.00014731931634
CC ID:47 NAME: G_Ca_B in component Background_Ca_Current VALUE:0.0006563656514
CC ID:19 NAME: Ibar_NaK in component Na_K_Pump_Current VALUE:1.07098002
CC ID:44 NAME: Ibar_NCX in component Na_Ca_Exchanger_Current VALUE:3.423105
CC ID:46 NAME: Ibar_PMCA in component Sarcolemmal_Ca_Pump_Current VALUE:0.065910928
CC ID:12 NAME: J_Ca_juncsl in component membrane VALUE:9.7435736118e-13
CC ID:13 NAME: J_Ca_slmyo in component membrane VALUE:3.7469586412e-12
CC ID:60 NAME: k_SR_leak in component SR_Fluxes VALUE:6.47701628e-06
CC ID:55 NAME: ks in component SR_Fluxes VALUE:20.63455
CC ID:58 NAME: V_max_SR_CaP in component SR_Fluxes VALUE:0.00474095564
CC ID:33 NAME: Ca_o in component Calcium_Concentrations VALUE:2.5
STP ID:0 - V in component membrane
STP ID:23 - Ca_i in component Calcium_Concentrations
CONFIG END
LoadFromCP: it = 0.000000
Doing CP It:7751370.000000
Doing CP It:15470000.000000
...
...
ID: 945 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jesús Carro
Project administrator
Project developer
Project scientist
Help desk expert
Avatar

Send message
Joined: 18 Mar 15
Posts: 178
Credit: 450,514
RAC: 51
Message 946 - Posted: 19 Jul 2016, 8:41:19 UTC - in response to Message 945.  

Could you send us the link to the task? Thank you!
Jesús Carro
Universidad San Jorge
@InSilicoHeart
ID: 946 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Col323

Send message
Joined: 5 Oct 15
Posts: 17
Credit: 1,335,501
RAC: 0
Message 947 - Posted: 19 Jul 2016, 11:53:13 UTC

Not sure all bugs are squashed just yet. Last night I was running WU GD_jcarro_20160714201545000000_ThirdSimulations_SteadyState3000Schmidt98_conf_689.xml_2. It was at 8 hours and about 65% complete. This morning, it was at 15 hours and only 5% complete. The machine did not shut off or reboot overnight. This sounds like the "hit 100% and start over" bug others have reported.

I verified it was version 1.08. I have aborted the task.
ID: 947 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile jcastro
Avatar

Send message
Joined: 16 Mar 15
Posts: 219
Credit: 14,859
RAC: 0
Message 948 - Posted: 19 Jul 2016, 13:10:56 UTC
Last modified: 19 Jul 2016, 13:11:31 UTC

Hi!
It's a strange behavior and in other machines it doesn't appear, so maybe it could be related with some problem of reading/writing the files.

We are looking in the report and we will try to solve it as fast as possible.

Best regards, Joel
ID: 948 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 8 Apr 15
Posts: 20
Credit: 64,498
RAC: 25
Message 949 - Posted: 19 Jul 2016, 13:38:44 UTC - in response to Message 946.  

Could you send us the link to the task? Thank you!


Chus / Joel:

Here is a link to the task that appeared to restart-from-beginning several times, each time I closed BOINC. I think resuming-from-checkpoint is broken. Please fix.

http://denis.usj.es/denisathome/result.php?resultid=38373900

Thanks,
Jacob
ID: 949 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[VENETO] boboviz

Send message
Joined: 9 Apr 15
Posts: 154
Credit: 612,522
RAC: 3,083
Message 950 - Posted: 19 Jul 2016, 15:21:51 UTC

No problems with my reboots. Wus restart correctly from checkpoint
ID: 950 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mm67

Send message
Joined: 12 Jul 15
Posts: 7
Credit: 43,028,399
RAC: 0
Message 951 - Posted: 19 Jul 2016, 15:56:13 UTC

Couple of 1.08 tasks just restarted from beginning and all had this message in log :

Task GD_jcarro_20160714202348000000_ThirdSimulations_SteadyState8000Schmidt98_conf_637.xml_2 exited with zero status but no 'finished' file
ID: 951 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile jcastro
Avatar

Send message
Joined: 16 Mar 15
Posts: 219
Credit: 14,859
RAC: 0
Message 952 - Posted: 19 Jul 2016, 19:01:14 UTC

We are looking some kind of pattern in the bug, so if you could send me more information about your computer and unusual configuration you have it will be really appreciated.(you could send it via email or private message in the forum)

Best regards, Joel.
ID: 952 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 8 Apr 15
Posts: 20
Credit: 64,498
RAC: 25
Message 953 - Posted: 19 Jul 2016, 19:06:30 UTC
Last modified: 19 Jul 2016, 19:09:23 UTC

Which bug? I think you have multiple.

Regarding the resume-from-checkpoint problem, when you restart BOINC to resume a task, does your stderr.txt (in your slot folder) ever say it resumes correctly?

Mine always says:
LoadFromCP: it = 0.000000
... and the progress restarts from 0%, but Elapsed does not reset.
Surely you can reproduce that behavior?

I would think it should resume (LoadFromCP) from a non-zero-iteration (it = xxx).
ID: 953 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mm67

Send message
Joined: 12 Jul 15
Posts: 7
Credit: 43,028,399
RAC: 0
Message 954 - Posted: 19 Jul 2016, 22:23:14 UTC - in response to Message 952.  

There's absolutely nothing unusual about the rig that restarted those tasks. It's this one : http://denis.usj.es/denisathome/show_host_detail.php?hostid=4788 and this is one of the tasks that restarted : http://denis.usj.es/denisathome/workunit.php?wuid=18823645
ID: 954 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile jcastro
Avatar

Send message
Joined: 16 Mar 15
Posts: 219
Credit: 14,859
RAC: 0
Message 956 - Posted: 20 Jul 2016, 14:22:24 UTC

We know that more than one computer is affected. But we have make tests in our windows computer with fresh windows 10 installed inside an it seems to work correctly. So we need more information to reproduce the bug in order to check for it's solution.

PD: http://denis.usj.es/denisathome/result.php?resultid=38370457

Best regards, Joel.
ID: 956 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mm67

Send message
Joined: 12 Jul 15
Posts: 7
Credit: 43,028,399
RAC: 0
Message 957 - Posted: 20 Jul 2016, 14:33:01 UTC - in response to Message 956.  

I had the same restarting problems on several Windows and Linux systems, all those tasks had this in their name : SteadyState8000 . Steadystate tasks with numbers other than 8000 had no problems
ID: 957 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 8 Apr 15
Posts: 20
Credit: 64,498
RAC: 25
Message 958 - Posted: 20 Jul 2016, 14:37:11 UTC
Last modified: 20 Jul 2016, 14:39:10 UTC

Edit: Yeah, SteadyState8000 seems to be the problematic ones - make sure you are testing those.

Well, I'm using an Insider Build of Windows 10, the latest fast-ring Build 14393. And I've seen the behavior on multiple PCs, all running Insider Builds, not just one PC. You can use Settings to sign up for the Insider Builds, and it takes 24-48 hours to get offered the build.

While it's possible the Insider build might be a potential cause, I think it's unlikely, unless the app itself isn't executing correctly using the new build (ie: an app problem).

Is there anything special the app does, when it tries to resume from checkpoint?
ID: 958 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Very long wus