Very long wus
log in

Advanced search

Message boards : Number crunching : Very long wus

Previous · 1 · 2 · 3 · 4 · Next
Author Message
Jacob Klein
Send message
Joined: 8 Apr 15
Posts: 20
Credit: 50,254
RAC: 0
Message 935 - Posted: 17 Jul 2016, 16:52:57 UTC
Last modified: 17 Jul 2016, 17:05:13 UTC

I just watched a task work its way up to 99.999% progress (with 0:00:01 remaining), then it went to 0% progress (with 1380+ days remaining). So, in addition to "resuming from checkpoint" possibly not working, I also think "exiting normally and gracefully at 100%" is also broken.

This is enough for me to set "No New Tasks" for all my PCs, and abort any in-progress DENIS work. In general, the aborted tasks had accumulated about 42 hours of run-time each, before I aborted them. :(

DEVS:
PLEASE figure these problems out, as you are wasting resources.

Thanks,
Jacob

Profile jcastro
Volunteer developer
Volunteer tester
Project scientist
Avatar
Send message
Joined: 16 Mar 15
Posts: 218
Credit: 14,859
RAC: 0
Message 936 - Posted: 17 Jul 2016, 18:07:33 UTC

Hello everyone,

We have found a bug in the long tasks. They overflow the capacity of certain variables in the code ( We have increase the length and the number of variables to modify in order to have better results)

We are modifying the code to avoid it. Version v1.07 solves the first overflow which was at initialization. But this new one is during recovering from checkpoint. We will upgrade the application as fast as we can.

Best regards, Joel.

Jacob Klein
Send message
Joined: 8 Apr 15
Posts: 20
Credit: 50,254
RAC: 0
Message 938 - Posted: 17 Jul 2016, 23:42:27 UTC

Thanks, Joel. I hope you get it sorted.

I'm subscribed to this thread, so if you drop a note in this thread, when things are working again, I'll be happy to unset "No New Tasks".

- Jacob

zioriga
Send message
Joined: 15 Apr 15
Posts: 1
Credit: 3,093,375
RAC: 0
Message 939 - Posted: 18 Jul 2016, 5:31:23 UTC

so is it better to abort all the WUs in the queue ???

[VENETO] boboviz
Send message
Joined: 9 Apr 15
Posts: 110
Credit: 403,294
RAC: 0
Message 940 - Posted: 18 Jul 2016, 9:50:51 UTC - in response to Message 939.

so is it better to abort all the WUs in the queue ???


"8000 series" for sure.
But i hope they validate "750 series" i'm crunching....

Profile jcastro
Volunteer developer
Volunteer tester
Project scientist
Avatar
Send message
Joined: 16 Mar 15
Posts: 218
Credit: 14,859
RAC: 0
Message 941 - Posted: 18 Jul 2016, 10:41:56 UTC

Hi!

We have upload new version of CRLP2011 that have that bug fixed.

Best regards, Joel

rjs5
Send message
Joined: 3 Nov 15
Posts: 18
Credit: 473,952
RAC: 0
Message 943 - Posted: 18 Jul 2016, 11:21:35 UTC - in response to Message 941.

Hi!

We have upload new version of CRLP2011 that have that bug fixed.

Best regards, Joel


Joel,
It would be helpful for you to give information with these "new application" messages.

Which version did you "upload"? I know what version I am running.
Which bug?
What should crunchers do with the JOBS in PROGRESS?

Jacob Klein
Send message
Joined: 8 Apr 15
Posts: 20
Credit: 50,254
RAC: 0
Message 945 - Posted: 18 Jul 2016, 19:34:11 UTC
Last modified: 18 Jul 2016, 19:35:49 UTC

Joel:

Are you sure resuming-from-checkpoint is working? It seems like it's not!

Check out this stderr.txt, from an in-progress task using the 1.08 app, where ... after it was about 80% done, I suspended all tasks, then closed BOINC, the reopened BOINC, then resumed all tasks, and now it looks like it's starting over from 0%. This stderr.txt makes it sound like it doesn't resume from checkpoint at all :( Can you explain please?

MName:CRLP2011_EPI MID:6 OpT:3000000.000000 DT:0.002000 OutFreq:50 InT:2999000.000000 NumConstToChange:20 NumStatesToPrint:2 NumAlgToPrint:0 CC ID:16 NAME: G_Na in component Fast_Na_Current VALUE:14.33914484 CC ID:17 NAME: G_Na_B in component Background_Na_Current VALUE:0.000522329628 CC ID:23 NAME: G_Kr in component Rapidly_Activating_K_Current VALUE:0.02741081 CC ID:24 NAME: G_Ks in component Slowly_Activating_K_Current VALUE:0.002511026 CC ID:25 NAME: G_Kp in component Plateau_K_Current VALUE:0.002508008 CC ID:26 NAME: G_to in component Transient_Outward_K_Current VALUE:0.14560234 CC ID:28 NAME: G_K1 in component Inward_Rectifier_K_Current VALUE:0.5212067835 CC ID:29 NAME: G_ClCa in component Ca_Activated_Cl_Current VALUE:0.063621668352 CC ID:31 NAME: G_Cl_B in component Background_Cl_Current VALUE:0.009779112 CC ID:34 NAME: G_Ca in component L_Type_Calcium_Current VALUE:0.00014731931634 CC ID:47 NAME: G_Ca_B in component Background_Ca_Current VALUE:0.0006563656514 CC ID:19 NAME: Ibar_NaK in component Na_K_Pump_Current VALUE:1.07098002 CC ID:44 NAME: Ibar_NCX in component Na_Ca_Exchanger_Current VALUE:3.423105 CC ID:46 NAME: Ibar_PMCA in component Sarcolemmal_Ca_Pump_Current VALUE:0.065910928 CC ID:12 NAME: J_Ca_juncsl in component membrane VALUE:9.7435736118e-13 CC ID:13 NAME: J_Ca_slmyo in component membrane VALUE:3.7469586412e-12 CC ID:60 NAME: k_SR_leak in component SR_Fluxes VALUE:6.47701628e-06 CC ID:55 NAME: ks in component SR_Fluxes VALUE:20.63455 CC ID:58 NAME: V_max_SR_CaP in component SR_Fluxes VALUE:0.00474095564 CC ID:33 NAME: Ca_o in component Calcium_Concentrations VALUE:2.5 STP ID:0 - V in component membrane STP ID:23 - Ca_i in component Calcium_Concentrations CONFIG END Doing CP It:8421908.000000 Doing CP It:16905322.000000 Doing CP It:25084729.000000 Doing CP It:33445063.000000 Doing CP It:42027710.000000 ... ... ... Doing CP It:1248476060.000000 Doing CP It:1256196617.000000MName:CRLP2011_EPI MID:6 OpT:3000000.000000 DT:0.002000 OutFreq:50 InT:2999000.000000 NumConstToChange:20 NumStatesToPrint:2 NumAlgToPrint:0 CC ID:16 NAME: G_Na in component Fast_Na_Current VALUE:14.33914484 CC ID:17 NAME: G_Na_B in component Background_Na_Current VALUE:0.000522329628 CC ID:23 NAME: G_Kr in component Rapidly_Activating_K_Current VALUE:0.02741081 CC ID:24 NAME: G_Ks in component Slowly_Activating_K_Current VALUE:0.002511026 CC ID:25 NAME: G_Kp in component Plateau_K_Current VALUE:0.002508008 CC ID:26 NAME: G_to in component Transient_Outward_K_Current VALUE:0.14560234 CC ID:28 NAME: G_K1 in component Inward_Rectifier_K_Current VALUE:0.5212067835 CC ID:29 NAME: G_ClCa in component Ca_Activated_Cl_Current VALUE:0.063621668352 CC ID:31 NAME: G_Cl_B in component Background_Cl_Current VALUE:0.009779112 CC ID:34 NAME: G_Ca in component L_Type_Calcium_Current VALUE:0.00014731931634 CC ID:47 NAME: G_Ca_B in component Background_Ca_Current VALUE:0.0006563656514 CC ID:19 NAME: Ibar_NaK in component Na_K_Pump_Current VALUE:1.07098002 CC ID:44 NAME: Ibar_NCX in component Na_Ca_Exchanger_Current VALUE:3.423105 CC ID:46 NAME: Ibar_PMCA in component Sarcolemmal_Ca_Pump_Current VALUE:0.065910928 CC ID:12 NAME: J_Ca_juncsl in component membrane VALUE:9.7435736118e-13 CC ID:13 NAME: J_Ca_slmyo in component membrane VALUE:3.7469586412e-12 CC ID:60 NAME: k_SR_leak in component SR_Fluxes VALUE:6.47701628e-06 CC ID:55 NAME: ks in component SR_Fluxes VALUE:20.63455 CC ID:58 NAME: V_max_SR_CaP in component SR_Fluxes VALUE:0.00474095564 CC ID:33 NAME: Ca_o in component Calcium_Concentrations VALUE:2.5 STP ID:0 - V in component membrane STP ID:23 - Ca_i in component Calcium_Concentrations CONFIG END LoadFromCP: it = 0.000000 Doing CP It:7751370.000000 Doing CP It:15470000.000000 ... ...

Profile Chus Carro
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 18 Mar 15
Posts: 82
Credit: 424,346
RAC: 0
Message 946 - Posted: 19 Jul 2016, 8:41:19 UTC - in response to Message 945.

Could you send us the link to the task? Thank you!
____________
Jesús Carro
San Jorge University
@ChusCarro

Col323
Send message
Joined: 5 Oct 15
Posts: 17
Credit: 1,335,501
RAC: 0
Message 947 - Posted: 19 Jul 2016, 11:53:13 UTC

Not sure all bugs are squashed just yet. Last night I was running WU GD_jcarro_20160714201545000000_ThirdSimulations_SteadyState3000Schmidt98_conf_689.xml_2. It was at 8 hours and about 65% complete. This morning, it was at 15 hours and only 5% complete. The machine did not shut off or reboot overnight. This sounds like the "hit 100% and start over" bug others have reported.

I verified it was version 1.08. I have aborted the task.

Profile jcastro
Volunteer developer
Volunteer tester
Project scientist
Avatar
Send message
Joined: 16 Mar 15
Posts: 218
Credit: 14,859
RAC: 0
Message 948 - Posted: 19 Jul 2016, 13:10:56 UTC
Last modified: 19 Jul 2016, 13:11:31 UTC

Hi!
It's a strange behavior and in other machines it doesn't appear, so maybe it could be related with some problem of reading/writing the files.

We are looking in the report and we will try to solve it as fast as possible.

Best regards, Joel

Jacob Klein
Send message
Joined: 8 Apr 15
Posts: 20
Credit: 50,254
RAC: 0
Message 949 - Posted: 19 Jul 2016, 13:38:44 UTC - in response to Message 946.

Could you send us the link to the task? Thank you!


Chus / Joel:

Here is a link to the task that appeared to restart-from-beginning several times, each time I closed BOINC. I think resuming-from-checkpoint is broken. Please fix.

http://denis.usj.es/denisathome/result.php?resultid=38373900

Thanks,
Jacob

[VENETO] boboviz
Send message
Joined: 9 Apr 15
Posts: 110
Credit: 403,294
RAC: 0
Message 950 - Posted: 19 Jul 2016, 15:21:51 UTC

No problems with my reboots. Wus restart correctly from checkpoint

mm67
Send message
Joined: 12 Jul 15
Posts: 7
Credit: 43,028,399
RAC: 0
Message 951 - Posted: 19 Jul 2016, 15:56:13 UTC

Couple of 1.08 tasks just restarted from beginning and all had this message in log :

Task GD_jcarro_20160714202348000000_ThirdSimulations_SteadyState8000Schmidt98_conf_637.xml_2 exited with zero status but no 'finished' file

Profile jcastro
Volunteer developer
Volunteer tester
Project scientist
Avatar
Send message
Joined: 16 Mar 15
Posts: 218
Credit: 14,859
RAC: 0
Message 952 - Posted: 19 Jul 2016, 19:01:14 UTC

We are looking some kind of pattern in the bug, so if you could send me more information about your computer and unusual configuration you have it will be really appreciated.(you could send it via email or private message in the forum)

Best regards, Joel.

Jacob Klein
Send message
Joined: 8 Apr 15
Posts: 20
Credit: 50,254
RAC: 0
Message 953 - Posted: 19 Jul 2016, 19:06:30 UTC
Last modified: 19 Jul 2016, 19:09:23 UTC

Which bug? I think you have multiple.

Regarding the resume-from-checkpoint problem, when you restart BOINC to resume a task, does your stderr.txt (in your slot folder) ever say it resumes correctly?

Mine always says:

LoadFromCP: it = 0.000000
... and the progress restarts from 0%, but Elapsed does not reset.
Surely you can reproduce that behavior?

I would think it should resume (LoadFromCP) from a non-zero-iteration (it = xxx).

mm67
Send message
Joined: 12 Jul 15
Posts: 7
Credit: 43,028,399
RAC: 0
Message 954 - Posted: 19 Jul 2016, 22:23:14 UTC - in response to Message 952.

There's absolutely nothing unusual about the rig that restarted those tasks. It's this one : http://denis.usj.es/denisathome/show_host_detail.php?hostid=4788 and this is one of the tasks that restarted : http://denis.usj.es/denisathome/workunit.php?wuid=18823645

Profile jcastro
Volunteer developer
Volunteer tester
Project scientist
Avatar
Send message
Joined: 16 Mar 15
Posts: 218
Credit: 14,859
RAC: 0
Message 956 - Posted: 20 Jul 2016, 14:22:24 UTC

We know that more than one computer is affected. But we have make tests in our windows computer with fresh windows 10 installed inside an it seems to work correctly. So we need more information to reproduce the bug in order to check for it's solution.

PD: http://denis.usj.es/denisathome/result.php?resultid=38370457

Best regards, Joel.

mm67
Send message
Joined: 12 Jul 15
Posts: 7
Credit: 43,028,399
RAC: 0
Message 957 - Posted: 20 Jul 2016, 14:33:01 UTC - in response to Message 956.

I had the same restarting problems on several Windows and Linux systems, all those tasks had this in their name : SteadyState8000 . Steadystate tasks with numbers other than 8000 had no problems

Jacob Klein
Send message
Joined: 8 Apr 15
Posts: 20
Credit: 50,254
RAC: 0
Message 958 - Posted: 20 Jul 2016, 14:37:11 UTC
Last modified: 20 Jul 2016, 14:39:10 UTC

Edit: Yeah, SteadyState8000 seems to be the problematic ones - make sure you are testing those.

Well, I'm using an Insider Build of Windows 10, the latest fast-ring Build 14393. And I've seen the behavior on multiple PCs, all running Insider Builds, not just one PC. You can use Settings to sign up for the Insider Builds, and it takes 24-48 hours to get offered the build.

While it's possible the Insider build might be a potential cause, I think it's unlikely, unless the app itself isn't executing correctly using the new build (ie: an app problem).

Is there anything special the app does, when it tries to resume from checkpoint?

Previous · 1 · 2 · 3 · 4 · Next
Post to thread

Message boards : Number crunching : Very long wus


Main page · Your account · Message boards


Copyright © 2020 Universidad San Jorge