𝕏

version 1.04 checkpoint experiments (windows) and now 1.05 experiments...

Message boards : Number crunching : version 1.04 checkpoint experiments (windows) and now 1.05 experiments...
Message board moderation

To post messages, you must log in.

AuthorMessage
Pettra
Avatar

Send message
Joined: 5 May 15
Posts: 6
Credit: 27,136
RAC: 0
Message 269 - Posted: 4 Jun 2015, 14:52:33 UTC
Last modified: 4 Jun 2015, 14:58:42 UTC

If this is useful.

Experiment #1:
Suspended task http://denis.usj.es/denisathome/result.php?resultid=4610080 without "leave applications in memory while suspended".
Ended in computation error.

Experiment #2:
Suspended task http://denis.usj.es/denisathome/result.php?resultid=4610161 with "leave applications in memory while suspended". Event log reported task checkpointing approximately every 1 minute 6 seconds.

Reached 100% and BOINC Manager immediately crashed.

Restarted BOINC and all tasks (including other project tasks) had disappeared. Had to reboot computer. After doing so, all remaining Denis tasks (8 of them) errored out immediately. examples: one that had been in progress at time of crash http://denis.usj.es/denisathome/result.php?resultid=4610069 and one that was not http://denis.usj.es/denisathome/result.php?resultid=4610172

Now waiting to see if the task I suspended validates against wingman. http://denis.usj.es/denisathome/workunit.php?wuid=2254606
Stderr output looks normal.

Might just be a problem my end.

edit: suspending denis tasks is not something I actually have ever NEEDED to do. So it is not likely to be a problem for me. I was just being curious.
ID: 269 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile jcastro
Avatar

Send message
Joined: 16 Mar 15
Posts: 219
Credit: 14,859
RAC: 0
Message 271 - Posted: 4 Jun 2015, 15:00:38 UTC - in response to Message 269.  

Hi!

First of all sorry about that incovenience. It's great that you give us that info, because some problems are really difficult to isolate. With your info and some more we have seen that when the computer restarts the checkpoint seems not being working.

With your information we can try to reproduce the bug and solve it.

We will solve it as soon as possible.

Thanks, Joel.
ID: 271 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Volodymyr Goroshko

Send message
Joined: 9 Apr 15
Posts: 1
Credit: 450,419
RAC: 0
Message 274 - Posted: 4 Jun 2015, 19:32:25 UTC - in response to Message 271.  

Hi!
Recently I notice in a lot of tasks with an error. Is it set whether a problem with boinc-client ver.6.10.45 for domain or old CPU's? Thank you.
ID: 274 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yavanius
Avatar

Send message
Joined: 20 May 15
Posts: 50
Credit: 340,808
RAC: 859
Message 279 - Posted: 8 Jun 2015, 20:40:50 UTC

Sorry for the delay posting, but I've also been noticing a lot of the same errors too across two different computers so it's not just the computer/client.

Host IDs 2885 and 2886
ID: 279 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile jcastro
Avatar

Send message
Joined: 16 Mar 15
Posts: 219
Credit: 14,859
RAC: 0
Message 280 - Posted: 8 Jun 2015, 22:11:58 UTC - in response to Message 279.  

Sorry for the delay posting, but I've also been noticing a lot of the same errors too across two different computers so it's not just the computer/client.

Host IDs 2885 and 2886


Hi!

In the new version, we try to avoid the application to re-use old checkpoint file. But this change to solve de v1.02 and v1.03 bug generate a new bug in the way that the file is opened.

We are working now in a more fault tolerant way to open this files.

Thanks for your patience, Joel.
ID: 280 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pettra
Avatar

Send message
Joined: 5 May 15
Posts: 6
Credit: 27,136
RAC: 0
Message 292 - Posted: 15 Jun 2015, 20:09:06 UTC
Last modified: 15 Jun 2015, 20:16:17 UTC

As requested here :) I will try same experiment I did with version 1.04, but with the 1.05 version - although it might not be till tomorrow now. Hope that's ok.

edit: to alter thread title :)
ID: 292 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pettra
Avatar

Send message
Joined: 5 May 15
Posts: 6
Credit: 27,136
RAC: 0
Message 300 - Posted: 17 Jun 2015, 1:11:36 UTC
Last modified: 17 Jun 2015, 1:13:12 UTC

Sorry. Slight delay since my last message :( power outage. Only had chance to do this so far.

Suspending without "leave applications in memory” checked

Task resumed normally :) and didn't error out :) and validated :)

"Properties dialogue box" appeared to show checkpointing correctly, as did the event log (with checkpoint debug enabled)

HOWEVER :/ runtime seems to extend - by about the same time that the task reached when it was suspended.

for example: with my current task runtimes averaging around 33 minutes, I suspended this one after thirty minutes:

AF201506161800_300_2166 http://denis.usj.es/denisathome/result.php?resultid=4901628 It took an extra 32 minutes to complete (runtime total of 1:02:48) but did validate.

Might just be a longer wu of course, and so a coincidence, so I will repeat with a couple more and will also suspend one more than once :) just to see what happens... :)

Will experiment with "leave applications in memory” checked, sometime tomorrow hopefully.
ID: 300 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile jcastro
Avatar

Send message
Joined: 16 Mar 15
Posts: 219
Credit: 14,859
RAC: 0
Message 305 - Posted: 17 Jun 2015, 11:13:16 UTC - in response to Message 300.  

Sorry. Slight delay since my last message :( power outage. Only had chance to do this so far.

Suspending without "leave applications in memory” checked

Task resumed normally :) and didn't error out :) and validated :)

"Properties dialogue box" appeared to show checkpointing correctly, as did the event log (with checkpoint debug enabled)

HOWEVER :/ runtime seems to extend - by about the same time that the task reached when it was suspended.

for example: with my current task runtimes averaging around 33 minutes, I suspended this one after thirty minutes:

AF201506161800_300_2166 http://denis.usj.es/denisathome/result.php?resultid=4901628 It took an extra 32 minutes to complete (runtime total of 1:02:48) but did validate.

Might just be a longer wu of course, and so a coincidence, so I will repeat with a couple more and will also suspend one more than once :) just to see what happens... :)

Will experiment with "leave applications in memory” checked, sometime tomorrow hopefully.



Thank you! It's great to be in a community with people like you, so helpful and kind with us. Recovery from a checkpoint usually cost some time and we are working now to decrease it.

We hope that the next experiment could be a success.

And again, thanks for you help! is great to have people like you in our project.

Best regards, Joel.
ID: 305 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pettra
Avatar

Send message
Joined: 5 May 15
Posts: 6
Credit: 27,136
RAC: 0
Message 308 - Posted: 17 Jun 2015, 13:09:55 UTC
Last modified: 17 Jun 2015, 13:10:29 UTC

Recovery from a checkpoint usually cost some time and we are working now to decrease it.

Great! :) Thanks Joel. Recovery time doesn't seem to be significant when suspended tasks are stored in memory however (experiment #2 below) - so good news! :)

First though, I have a little more to add to my last post:

EXPERIMENT #1 (cont)
Tasks suspended without "leave applications in memory” checked...
AF201506161800_300_2172: http://denis.usj.es/denisathome/result.php?resultid=4901641
suspended with elapsed time of 0:30:18 (checkpointed at 0:29:00 according to properties dialog box)
Task finished with runtime of: 1:01:29

Task AF201506161800_300_2129: http://denis.usj.es/denisathome/result.php?resultid=4901554
suspended with elapsed time of 0:30:19 (checkpointed at 0:29:13 according to properties dialog box)
Suspended again at 1.00.13 (checkpoint per properties dialog box 0:59:29)
Task finished with runtime of: 1:32:13

AF201506161800_300_2160: http://denis.usj.es/denisathome/result.php?resultid=4901617
suspended with elapsed time of 0:20:00 (checkpointed at 0:19:08 according to properties dialog box)
Task finished with runtime of 1:00:23 so could have been a longer wu

AF201506161800_300_2155 http://denis.usj.es/denisathome/result.php?resultid=4901606
suspended with elapsed time of 0:10:47 (checkpointed at 0:09:37 according to properties dialog box))
Task finished with runtime of 0:49:56

THEY ALL VALIDATED instead of erroring out!!! YAY!!! :)


EXPERIMENT #2: Suspended WITH "leave applications in memory” checked...

Task AF201506161800_300_2112 http://denis.usj.es/denisathome/result.php?resultid=4901520
suspended with elapsed time of 0:28:34 (checkpointed at 0:28:06 according to properties dialog box)
Task finished with runtime of 0:31:18

Task AF201506161800_300_2125 http://denis.usj.es/denisathome/result.php?resultid=4901547
suspended with elapsed time of 0:30:22 (checkpointed at 0:29:11 according to properties dialog box)
Task finished with runtime of 0:32:17

Runtimes effectively match my average task times :)

Not only that... but when I did this with version 1.04 boinc crashed but task validated...

This time boinc DIDN'T crash and tasks validated!! :)
So YAY again :)
ID: 308 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kain

Send message
Joined: 16 Apr 15
Posts: 20
Credit: 4,935,344
RAC: 6,877
Message 315 - Posted: 18 Jun 2015, 10:27:34 UTC

Good job! :)
ID: 315 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile jcastro
Avatar

Send message
Joined: 16 Mar 15
Posts: 219
Credit: 14,859
RAC: 0
Message 316 - Posted: 18 Jun 2015, 10:46:13 UTC - in response to Message 315.  

Good job! :)

DENIS Staff is also really proud of have crunchers as you in our project! We have to create some kind of badge for people like you! ( We will think on it :) )

Again, Thanks a lot, DENIS Staff
ID: 316 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : version 1.04 checkpoint experiments (windows) and now 1.05 experiments...