Posts by Jacob Klein
log in
1) Message boards : Number crunching : Very long wus (Message 983)
Posted 23 Jul 2016 by Jacob Klein
I have changed my web setting, to allow Beta, so I can help test.
It downloaded some "Beta v1.03" tasks. Is this a new executable (with potential fixes), or an old one (from a prior beta test)?

Also, you should consider putting release notes in the News section, for each public and each beta release.

Edit:
Checkpointing still is not working, on the Beta, per below. Dang.
Doing CP It:8630444.000000 Doing CP It:17270947.000000 Doing CP It:25836125.000000 Doing CP It:34510980.000000

... restarted BOINC ...
Checkpoint file not found Doing CP It:8609762.000000
2) Message boards : Number crunching : Very long wus (Message 981)
Posted 23 Jul 2016 by Jacob Klein
:) That one with the large negative iteration numbers ... looks like a v1.07 task, where they had problems where the iteration number was being stored in a variable that was too small (hence the large negative numbers).

You need to cancel your v1.07 tasks, and get some v1.08 tasks.

But, warning, when you get the v1.08 tasks, I'm betting you'll possibly encounter the same issues some of us are seeing -- 1) not resuming from checkpoints, and 2) looping instead of completing.

I've set "No New Tasks". Devs are trying to solve those problems, listed throughout this thread.
3) Message boards : Number crunching : Very long wus (Message 979)
Posted 22 Jul 2016 by Jacob Klein
Robert:

Take a peek at the stderr.txt files in the slots folders for each of the tasks. Does it look like it started over on its own yet -- Called boinc_finish(0), then restarted all over, from iteration: 0 ?

If so, then the task already looped on you.
4) Message boards : Number crunching : Very long wus (Message 975)
Posted 22 Jul 2016 by Jacob Klein
Thanks, Joel. You are making the right call, in regards to both suspending work and granting credits. Please note that I don't care at all about the credits. I care about my CPU cycles not being wasted. Once you have something for us to test, be sure to let us know in this thread, and I'd be more than happy to unset "No New Tasks" to help test!

- Jacob
5) Message boards : Number crunching : WUs reset? (Message 971)
Posted 22 Jul 2016 by Jacob Klein
We're tracking a couple problems in this thread:
http://denis.usj.es/denisathome/forum_thread.php?id=105

Namely:
- Some tasks aren't checkpointing properly
- Some tasks aren't finishing gracefully, and are instead starting completely over infinitely.

:/

You might want to subscribe to that other thread, and hope the developers can fix it up. I've set "No New Tasks" on all my PCs, until it can work correctly.
6) Message boards : Number crunching : Very long wus (Message 969)
Posted 21 Jul 2016 by Jacob Klein
Here's another:
http://denis.usj.es/denisathome/result.php?resultid=38376733

After a couple "unnecessary full restarts", I let it run for a very long time continuously. It eventually called "boinc_finish(0)", and I watched it go from 100% to 0%, and keep wasting my CPU. I then aborted it.

Very frustrating. I hope you can figure it out.
7) Message boards : Number crunching : Very long wus (Message 967)
Posted 21 Jul 2016 by Jacob Klein
Here's another v1.08 task that didn't function correctly.
http://denis.usj.es/denisathome/result.php?resultid=38374146

It looks like it:
- Checkpointed up to iteration: 1484766329
- I (presume that I) exited BOINC
- Loaded from checkpoint iteration: 1484766329
- Checkpointed up to iteration: 8456372551
- Called boinc_finish(0)
- Did not exit gracefully
- Continued by restarting the task all over, from iteration: 0
- Checkpointed up to iteration: 1204232792
- I aborted it.

I have again set "No New Tasks" for this project, since otherwise it's wasting my CPU resources. I hope you can do extra testing, to get a better handle on solving your issues.

For your reference, I do use the option "Leave non-GPU tasks in memory while suspended", and generally I use Activity->Suspend, before exiting BOINC. The PC that this task was aborted on, is on Windows 10 Insider Build 14393, with BOINC installed as a service. I'm attached to 50+ projects, get work from about 10-15 of them, and yours is the only one not working.

Good luck fixing it. I'll be monitoring this thread.

Thanks,
Jacob Klein
8) Message boards : Number crunching : Very long wus (Message 959)
Posted 20 Jul 2016 by Jacob Klein
Maybe all the SteadyState ones are affected??
Make sure to test SteadState tasks.

Here's a SteadyState1000, that also didn't resume from checkpoint correctly:
http://denis.usj.es/denisathome/result.php?resultid=38373900

Here's a SteadyState2000, that also didn't resume from checkpoint correctly:
http://denis.usj.es/denisathome/result.php?resultid=38322366

Here's a SteadyState3000, that also didn't resume from checkpoint correctly:
http://denis.usj.es/denisathome/result.php?resultid=38325794
9) Message boards : Number crunching : Very long wus (Message 958)
Posted 20 Jul 2016 by Jacob Klein
Edit: Yeah, SteadyState8000 seems to be the problematic ones - make sure you are testing those.

Well, I'm using an Insider Build of Windows 10, the latest fast-ring Build 14393. And I've seen the behavior on multiple PCs, all running Insider Builds, not just one PC. You can use Settings to sign up for the Insider Builds, and it takes 24-48 hours to get offered the build.

While it's possible the Insider build might be a potential cause, I think it's unlikely, unless the app itself isn't executing correctly using the new build (ie: an app problem).

Is there anything special the app does, when it tries to resume from checkpoint?
10) Message boards : Number crunching : Very long wus (Message 953)
Posted 19 Jul 2016 by Jacob Klein
Which bug? I think you have multiple.

Regarding the resume-from-checkpoint problem, when you restart BOINC to resume a task, does your stderr.txt (in your slot folder) ever say it resumes correctly?

Mine always says:
LoadFromCP: it = 0.000000
... and the progress restarts from 0%, but Elapsed does not reset.
Surely you can reproduce that behavior?

I would think it should resume (LoadFromCP) from a non-zero-iteration (it = xxx).
11) Message boards : Number crunching : Very long wus (Message 949)
Posted 19 Jul 2016 by Jacob Klein
Could you send us the link to the task? Thank you!


Chus / Joel:

Here is a link to the task that appeared to restart-from-beginning several times, each time I closed BOINC. I think resuming-from-checkpoint is broken. Please fix.

http://denis.usj.es/denisathome/result.php?resultid=38373900

Thanks,
Jacob
12) Message boards : Number crunching : Very long wus (Message 945)
Posted 18 Jul 2016 by Jacob Klein
Joel:

Are you sure resuming-from-checkpoint is working? It seems like it's not!

Check out this stderr.txt, from an in-progress task using the 1.08 app, where ... after it was about 80% done, I suspended all tasks, then closed BOINC, the reopened BOINC, then resumed all tasks, and now it looks like it's starting over from 0%. This stderr.txt makes it sound like it doesn't resume from checkpoint at all :( Can you explain please?

MName:CRLP2011_EPI MID:6 OpT:3000000.000000 DT:0.002000 OutFreq:50 InT:2999000.000000 NumConstToChange:20 NumStatesToPrint:2 NumAlgToPrint:0 CC ID:16 NAME: G_Na in component Fast_Na_Current VALUE:14.33914484 CC ID:17 NAME: G_Na_B in component Background_Na_Current VALUE:0.000522329628 CC ID:23 NAME: G_Kr in component Rapidly_Activating_K_Current VALUE:0.02741081 CC ID:24 NAME: G_Ks in component Slowly_Activating_K_Current VALUE:0.002511026 CC ID:25 NAME: G_Kp in component Plateau_K_Current VALUE:0.002508008 CC ID:26 NAME: G_to in component Transient_Outward_K_Current VALUE:0.14560234 CC ID:28 NAME: G_K1 in component Inward_Rectifier_K_Current VALUE:0.5212067835 CC ID:29 NAME: G_ClCa in component Ca_Activated_Cl_Current VALUE:0.063621668352 CC ID:31 NAME: G_Cl_B in component Background_Cl_Current VALUE:0.009779112 CC ID:34 NAME: G_Ca in component L_Type_Calcium_Current VALUE:0.00014731931634 CC ID:47 NAME: G_Ca_B in component Background_Ca_Current VALUE:0.0006563656514 CC ID:19 NAME: Ibar_NaK in component Na_K_Pump_Current VALUE:1.07098002 CC ID:44 NAME: Ibar_NCX in component Na_Ca_Exchanger_Current VALUE:3.423105 CC ID:46 NAME: Ibar_PMCA in component Sarcolemmal_Ca_Pump_Current VALUE:0.065910928 CC ID:12 NAME: J_Ca_juncsl in component membrane VALUE:9.7435736118e-13 CC ID:13 NAME: J_Ca_slmyo in component membrane VALUE:3.7469586412e-12 CC ID:60 NAME: k_SR_leak in component SR_Fluxes VALUE:6.47701628e-06 CC ID:55 NAME: ks in component SR_Fluxes VALUE:20.63455 CC ID:58 NAME: V_max_SR_CaP in component SR_Fluxes VALUE:0.00474095564 CC ID:33 NAME: Ca_o in component Calcium_Concentrations VALUE:2.5 STP ID:0 - V in component membrane STP ID:23 - Ca_i in component Calcium_Concentrations CONFIG END Doing CP It:8421908.000000 Doing CP It:16905322.000000 Doing CP It:25084729.000000 Doing CP It:33445063.000000 Doing CP It:42027710.000000 ... ... ... Doing CP It:1248476060.000000 Doing CP It:1256196617.000000MName:CRLP2011_EPI MID:6 OpT:3000000.000000 DT:0.002000 OutFreq:50 InT:2999000.000000 NumConstToChange:20 NumStatesToPrint:2 NumAlgToPrint:0 CC ID:16 NAME: G_Na in component Fast_Na_Current VALUE:14.33914484 CC ID:17 NAME: G_Na_B in component Background_Na_Current VALUE:0.000522329628 CC ID:23 NAME: G_Kr in component Rapidly_Activating_K_Current VALUE:0.02741081 CC ID:24 NAME: G_Ks in component Slowly_Activating_K_Current VALUE:0.002511026 CC ID:25 NAME: G_Kp in component Plateau_K_Current VALUE:0.002508008 CC ID:26 NAME: G_to in component Transient_Outward_K_Current VALUE:0.14560234 CC ID:28 NAME: G_K1 in component Inward_Rectifier_K_Current VALUE:0.5212067835 CC ID:29 NAME: G_ClCa in component Ca_Activated_Cl_Current VALUE:0.063621668352 CC ID:31 NAME: G_Cl_B in component Background_Cl_Current VALUE:0.009779112 CC ID:34 NAME: G_Ca in component L_Type_Calcium_Current VALUE:0.00014731931634 CC ID:47 NAME: G_Ca_B in component Background_Ca_Current VALUE:0.0006563656514 CC ID:19 NAME: Ibar_NaK in component Na_K_Pump_Current VALUE:1.07098002 CC ID:44 NAME: Ibar_NCX in component Na_Ca_Exchanger_Current VALUE:3.423105 CC ID:46 NAME: Ibar_PMCA in component Sarcolemmal_Ca_Pump_Current VALUE:0.065910928 CC ID:12 NAME: J_Ca_juncsl in component membrane VALUE:9.7435736118e-13 CC ID:13 NAME: J_Ca_slmyo in component membrane VALUE:3.7469586412e-12 CC ID:60 NAME: k_SR_leak in component SR_Fluxes VALUE:6.47701628e-06 CC ID:55 NAME: ks in component SR_Fluxes VALUE:20.63455 CC ID:58 NAME: V_max_SR_CaP in component SR_Fluxes VALUE:0.00474095564 CC ID:33 NAME: Ca_o in component Calcium_Concentrations VALUE:2.5 STP ID:0 - V in component membrane STP ID:23 - Ca_i in component Calcium_Concentrations CONFIG END LoadFromCP: it = 0.000000 Doing CP It:7751370.000000 Doing CP It:15470000.000000 ... ...
13) Message boards : Number crunching : Very long wus (Message 938)
Posted 17 Jul 2016 by Jacob Klein
Thanks, Joel. I hope you get it sorted.

I'm subscribed to this thread, so if you drop a note in this thread, when things are working again, I'll be happy to unset "No New Tasks".

- Jacob
14) Message boards : Number crunching : Very long wus (Message 935)
Posted 17 Jul 2016 by Jacob Klein
I just watched a task work its way up to 99.999% progress (with 0:00:01 remaining), then it went to 0% progress (with 1380+ days remaining). So, in addition to "resuming from checkpoint" possibly not working, I also think "exiting normally and gracefully at 100%" is also broken.

This is enough for me to set "No New Tasks" for all my PCs, and abort any in-progress DENIS work. In general, the aborted tasks had accumulated about 42 hours of run-time each, before I aborted them. :(

DEVS:
PLEASE figure these problems out, as you are wasting resources.

Thanks,
Jacob
15) Message boards : Number crunching : Very long wus (Message 932)
Posted 17 Jul 2016 by Jacob Klein
From reading the stderr.txt files in the slots folders of some of my in-progress tasks... It seems that "resuming from checkpoint" is actually "restarting from scratch", on my 8000-series tasks.

Devs: Is that a bug?

I'm setting No New Tasks for now, as it is nearly impossible for me to guarantee nearly 24 hours of continuous runtime for the tasks, in my heavily-hyperthreaded PC.

I need "resuming from checkpoint" to be reliable.
16) Message boards : Number crunching : Bug: Email Notifications (Message 109)
Posted 13 Apr 2015 by Jacob Klein
I received the email notification for this thread, which I was subscribed to.
17) Message boards : Number crunching : Request: Increase deadlines (Message 30)
Posted 9 Apr 2015 by Jacob Klein
Let's keep this thread on-topic.

Please increase the deadlines :)
Otherwise, those of us with GPUs and coprocessors might have to detach :(
18) Message boards : Number crunching : Bug: Email Notifications (Message 12)
Posted 8 Apr 2015 by Jacob Klein
I have my profile set to "Email immediately", and I have have subscribed to a thread. Yet, when a reply happens, I do not get my email. Do you have that feature setup correctly?
19) Message boards : Number crunching : Request: Increase deadlines (Message 7)
Posted 8 Apr 2015 by Jacob Klein
Absolutely, I'm glad to be here, and glad to be able to give feedback. :)
So far, my feedback is: Deadlines are too short, causing my GPUs and bitcoin miner coprocessors to go idle, because BOINC schedules your tasks as priority to run them in "deadline-risk mode / earliest-deadline-first mode".

If you don't really need them in 12 hours time, then I recommend something like 5 days or 2 weeks.

Side note: For me, they complete in 2 minutes. That's potentially a lot of overhead in transferring data back and forth, and in looking at the task results lists. If possible, it might be better to have the tasks take around 4-48 hours each, with checkpoints.
20) Message boards : Number crunching : Request: Increase deadlines (Message 5)
Posted 8 Apr 2015 by Jacob Klein
DENIS admins:

Can you please increase your deadlines? They currently are something like 12 hours, and then get prioritized by some clients because they are so short, and then my GPUs go idle because your tasks are run in deadline-risk mode.

How quickly do you need the results? Can you consider changing it to 5 days or greater? Maybe even 2 weeks? I'd like my GPUs to be able to do work again, while running your project.

Thanks,
Jacob




Main page · Your account · Message boards


Copyright © 2019 Universidad San Jorge