version 1.04 checkpoint experiments (windows) and now 1.05 experiments...
Message boards :
Number crunching :
version 1.04 checkpoint experiments (windows) and now 1.05 experiments...
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 May 15 Posts: 6 Credit: 27,136 RAC: 0 |
If this is useful. Experiment #1: Suspended task http://denis.usj.es/denisathome/result.php?resultid=4610080 without "leave applications in memory while suspended". Ended in computation error. Experiment #2: Suspended task http://denis.usj.es/denisathome/result.php?resultid=4610161 with "leave applications in memory while suspended". Event log reported task checkpointing approximately every 1 minute 6 seconds. Reached 100% and BOINC Manager immediately crashed. Restarted BOINC and all tasks (including other project tasks) had disappeared. Had to reboot computer. After doing so, all remaining Denis tasks (8 of them) errored out immediately. examples: one that had been in progress at time of crash http://denis.usj.es/denisathome/result.php?resultid=4610069 and one that was not http://denis.usj.es/denisathome/result.php?resultid=4610172 Now waiting to see if the task I suspended validates against wingman. http://denis.usj.es/denisathome/workunit.php?wuid=2254606 Stderr output looks normal. Might just be a problem my end. edit: suspending denis tasks is not something I actually have ever NEEDED to do. So it is not likely to be a problem for me. I was just being curious. |
Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0 |
Hi! First of all sorry about that incovenience. It's great that you give us that info, because some problems are really difficult to isolate. With your info and some more we have seen that when the computer restarts the checkpoint seems not being working. With your information we can try to reproduce the bug and solve it. We will solve it as soon as possible. Thanks, Joel. |
Send message Joined: 9 Apr 15 Posts: 1 Credit: 450,419 RAC: 0 |
Hi! Recently I notice in a lot of tasks with an error. Is it set whether a problem with boinc-client ver.6.10.45 for domain or old CPU's? Thank you. |
Send message Joined: 20 May 15 Posts: 50 Credit: 390,872 RAC: 0 |
Sorry for the delay posting, but I've also been noticing a lot of the same errors too across two different computers so it's not just the computer/client. Host IDs 2885 and 2886 |
Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0 |
Sorry for the delay posting, but I've also been noticing a lot of the same errors too across two different computers so it's not just the computer/client. Hi! In the new version, we try to avoid the application to re-use old checkpoint file. But this change to solve de v1.02 and v1.03 bug generate a new bug in the way that the file is opened. We are working now in a more fault tolerant way to open this files. Thanks for your patience, Joel. |
Send message Joined: 5 May 15 Posts: 6 Credit: 27,136 RAC: 0 |
As requested here :) I will try same experiment I did with version 1.04, but with the 1.05 version - although it might not be till tomorrow now. Hope that's ok. edit: to alter thread title :) |
Send message Joined: 5 May 15 Posts: 6 Credit: 27,136 RAC: 0 |
Sorry. Slight delay since my last message :( power outage. Only had chance to do this so far. Suspending without "leave applications in memory” checked Task resumed normally :) and didn't error out :) and validated :) "Properties dialogue box" appeared to show checkpointing correctly, as did the event log (with checkpoint debug enabled) HOWEVER :/ runtime seems to extend - by about the same time that the task reached when it was suspended. for example: with my current task runtimes averaging around 33 minutes, I suspended this one after thirty minutes: AF201506161800_300_2166 http://denis.usj.es/denisathome/result.php?resultid=4901628 It took an extra 32 minutes to complete (runtime total of 1:02:48) but did validate. Might just be a longer wu of course, and so a coincidence, so I will repeat with a couple more and will also suspend one more than once :) just to see what happens... :) Will experiment with "leave applications in memory” checked, sometime tomorrow hopefully. |
Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0 |
Sorry. Slight delay since my last message :( power outage. Only had chance to do this so far. Thank you! It's great to be in a community with people like you, so helpful and kind with us. Recovery from a checkpoint usually cost some time and we are working now to decrease it. We hope that the next experiment could be a success. And again, thanks for you help! is great to have people like you in our project. Best regards, Joel. |
Send message Joined: 5 May 15 Posts: 6 Credit: 27,136 RAC: 0 |
Recovery from a checkpoint usually cost some time and we are working now to decrease it. Great! :) Thanks Joel. Recovery time doesn't seem to be significant when suspended tasks are stored in memory however (experiment #2 below) - so good news! :) First though, I have a little more to add to my last post: EXPERIMENT #1 (cont) Tasks suspended without "leave applications in memory” checked... AF201506161800_300_2172: http://denis.usj.es/denisathome/result.php?resultid=4901641 suspended with elapsed time of 0:30:18 (checkpointed at 0:29:00 according to properties dialog box) Task finished with runtime of: 1:01:29 Task AF201506161800_300_2129: http://denis.usj.es/denisathome/result.php?resultid=4901554 suspended with elapsed time of 0:30:19 (checkpointed at 0:29:13 according to properties dialog box) Suspended again at 1.00.13 (checkpoint per properties dialog box 0:59:29) Task finished with runtime of: 1:32:13 AF201506161800_300_2160: http://denis.usj.es/denisathome/result.php?resultid=4901617 suspended with elapsed time of 0:20:00 (checkpointed at 0:19:08 according to properties dialog box) Task finished with runtime of 1:00:23 so could have been a longer wu AF201506161800_300_2155 http://denis.usj.es/denisathome/result.php?resultid=4901606 suspended with elapsed time of 0:10:47 (checkpointed at 0:09:37 according to properties dialog box)) Task finished with runtime of 0:49:56 THEY ALL VALIDATED instead of erroring out!!! YAY!!! :) EXPERIMENT #2: Suspended WITH "leave applications in memory” checked... Task AF201506161800_300_2112 http://denis.usj.es/denisathome/result.php?resultid=4901520 suspended with elapsed time of 0:28:34 (checkpointed at 0:28:06 according to properties dialog box) Task finished with runtime of 0:31:18 Task AF201506161800_300_2125 http://denis.usj.es/denisathome/result.php?resultid=4901547 suspended with elapsed time of 0:30:22 (checkpointed at 0:29:11 according to properties dialog box) Task finished with runtime of 0:32:17 Runtimes effectively match my average task times :) Not only that... but when I did this with version 1.04 boinc crashed but task validated... This time boinc DIDN'T crash and tasks validated!! :) So YAY again :) |
Send message Joined: 16 Apr 15 Posts: 20 Credit: 5,195,178 RAC: 0 |
Good job! :) |
Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0 |
Good job! :) DENIS Staff is also really proud of have crunchers as you in our project! We have to create some kind of badge for people like you! ( We will think on it :) ) Again, Thanks a lot, DENIS Staff |