Very long wus
Message boards :
Number crunching :
Very long wus
Message board moderation
Previous · 1 · 2 · 3 · 4
Author | Message |
---|---|
Send message Joined: 6 Nov 15 Posts: 16 Credit: 325,209 RAC: 0 |
:) That one with the large negative iteration numbers ... looks like a v1.07 task, where they had problems where the iteration number was being stored in a variable that was too small (hence the large negative numbers). Both were 1.07; both now aborted. |
Send message Joined: 24 Apr 15 Posts: 1 Credit: 281,700 RAC: 0 |
I have a bunch of V1.03 with 1d17h and 81%. |
Send message Joined: 5 Oct 15 Posts: 17 Credit: 1,335,501 RAC: 0 |
BETA_23071039_8000_0173_0 looped on me (on the aforementioned Win 7 laptop), so I aborted it. This machine is also running two other Beta units. One has also looped, but I'm going to see if the second loop finishes correctly. The other unit has not looped, and I'm hoping that somehow it finishes gracefully. |
Send message Joined: 30 Aug 15 Posts: 47 Credit: 1,248,591 RAC: 0 |
I have 2 machines, one with Fedora ans one with Windows 7 Fedora machine has no problems with rebooting,Windows machine restarts for 0 after a reboot (with Beta work) LZ Loon The Most Handsome Man on the Interweb (TM) |
Send message Joined: 24 Oct 15 Posts: 16 Credit: 595,047 RAC: 0 |
I used SysInternals Process Monitor to check what the beta application does in the slot directory during a checkpoint. The sequence is as follows:
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0 |
I used SysInternals Process Monitor to check what the beta application does in the slot directory during a checkpoint. The sequence is as follows: Thanks for the info, we are trying different strategies to checkpoint to avoid the problems we have. Maybe this solution is no the correct. We are modifing a bit more. Best regards, Joel. |
Send message Joined: 6 Nov 15 Posts: 16 Credit: 325,209 RAC: 0 |
Last night, I decided to check whether beta 1.03 could recover from checkpoints, so I shut down BOINC and then did a Windows restart on one of my computers. The task started up again, could not find the checkpoint file, and restarted from the beginning. I do not expect it to finish by the deadline. |
Send message Joined: 5 Jul 15 Posts: 3 Credit: 1,074,712 RAC: 0 |
For what it's worth, I run Linux and if I select a task in boinc manager and look at the properties, one of the things it lists is the cpu time at the last checkpoint. If that's blank, I know that it has not checkpointed. I've also set no new work for DENIS on all my systems until I hear that these problems have been fixed. |
Send message Joined: 30 Aug 15 Posts: 47 Credit: 1,248,591 RAC: 0 |
|
Send message Joined: 6 Nov 15 Posts: 16 Credit: 325,209 RAC: 0 |
Some things I've seen on other BOINC projects: 1. Two separate checkpoint files per task, so that if the task is interrupted in the middle of writing a checkpoint, it can resume from the other one instead. This needs a method for determining which of the files has the more recent checkpoint, in case both contain valid checkpoints. 2. Something identifying which checkpoint is written placed at the beginning, which must match something at the end to indicate that the checkpoint is complete. Whatever is used should not match any other checkpoints from the same task. 3. Requests from the project for users to deliberately interrupt BOINC during beta tests, to check whether the checkpoints are working properly. Windows 10 has at least one update most Tuesday nights, with delaying them difficult, so you may soon see many more checkpoint resume failures from Windows 10 users. |
Send message Joined: 1 Jul 15 Posts: 2 Credit: 243,560 RAC: 0 |
I've got also an never ending result on a machine that is shut down every day. GD_jcarro_20160714202331000000_ThirdSimulations_SteadyState8000Schmidt98_conf_1983.xml Result: 38377099 Carro-Rodriguez-Laguna-Pueyo Epicardial Model (Carro et al. 2011) for human ventricular cells v1.08 On starting the computer at morning the result started calculation again from round 20%. System Windows 7 Looks like there was a checkpoint lost (extract from <stderr_txt>) Line 112 : LoadFromCP: it = 5956108503.000000 Line 263 : LoadFromCP: it = 6800888101.000000 Line 805 : LoadFromCP: it = 0.000000 Line 1217: LoadFromCP: it = 2471613195.000000 Matthias |
Send message Joined: 6 Nov 15 Posts: 16 Credit: 325,209 RAC: 0 |
Last night, I decided to check whether beta 1.03 could recover from checkpoints, so I shut down BOINC and then did a Windows restart on one of my computers. The task started up again, could not find the checkpoint file, and restarted from the beginning. I do not expect it to finish by the deadline. It looks like beta 1.03 does NOT fix the endless restarts problem. A few lines from stderr.txt showing this: Doing CP It:3396376837.000000 Doing CP It:3402929064.00000000:35:28 (2660): called boinc_finish(0) MName:CRLP2011_EPI MID:6 At least this time it created a file named temp, AFTER it restarted. Task 38385075 |
Send message Joined: 6 Nov 15 Posts: 16 Credit: 325,209 RAC: 0 |
On my other computer with a beta 1.03 task in progress, there is a temp file, but it's empty - 0 bytes. 47.338% progress, so it should have created many checkpoints by now. Task 38385601 Microsoft does not appear to have forced any Tuesday night Windows 10 updates requiring a Windows restart this week, so it may still complete without trying to resume from a checkpoint. A suggested feature for the next beta: For the first few checkpoints, report whether the checkpoint file even exists before trying to write the checkpoint, and if so, how many bytes it contains. You could also have the first few checkpoints report whether they were successful in writing to the checkpoint file. |
Send message Joined: 16 Mar 15 Posts: 219 Credit: 14,859 RAC: 0 |
Hi! We are thinking different ways to ensure checkpoints. As you have said, a double checkpoint file as a backup and check the size of the file is one of the ways that are on the table. Writing into disk is one of the most "expensive" things in the execution of the simulation, this is why at the beginning we have tried to simplify this part. But with those ultra-long WUs it could be necessary. Best regards, Joel. |
Send message Joined: 6 Nov 15 Posts: 16 Credit: 325,209 RAC: 0 |
The beta 1.03 task under Windows 10 also started the infinite reruns. Doing CP It:3980371292.000000 Doing CP It:3988241479.00000010:58:46 (8512): called boinc_finish(0) MName:CRLP2011_EPI MID:6 Even though the temp file was previously present and not empty, it could not find it after this restart. Your next beta version appears to need to be able to tell whether the checkpoint file is not present, present but empty, present and not empty but with the contents not usable, and some other error in reading it. The beta 1.03 task under Windows Vista started another rerun. When should I shut this down by aborting it? The temp file is still present but only 0 bytes. Do you have a way to make your next batch of Windows tasks use a much smaller number of iterations, so they can be aimed at testing whether the Windows version of the application shuts down properly? Also, they should send any remaining temp file back to you, so you can check if it's even in the right format. I believe I've read that some compilers allow creating files in such a way that they will automatically be deleted when the application that created them ends. You may need to inspect the Windows application to see if it does this. Also, does the Windows application close the checkpoint file after it finishes writing a checkpoint? If not, I'd expect the checkpoint file to be more likely to be lost, since it might be smaller than the hard drive block size, and therefore still in cache but not yet written to the hard drive. |
Send message Joined: 9 Apr 15 Posts: 172 Credit: 1,552,856 RAC: 0 |
Writing into disk is one of the most "expensive" things in the execution of the simulation, this is why at the beginning we have tried to simplify this part. But with those ultra-long WUs it could be necessary. I will buy an SSD! |
Send message Joined: 6 Nov 15 Posts: 16 Credit: 325,209 RAC: 0 |
Writing into disk is one of the most "expensive" things in the execution of the simulation, this is why at the beginning we have tried to simplify this part. But with those ultra-long WUs it could be necessary. I already have an SSD in each of my desktops. However, the one for my Windows Vista computer does not have a Windows Vista driver available. You may want to check for a similar problem before you buy any. |
Send message Joined: 9 Apr 15 Posts: 172 Credit: 1,552,856 RAC: 0 |
I will buy an SSD! My pc are Win10 And i will use Acronis to clone disk to ssd.... |
Send message Joined: 23 Dec 15 Posts: 6 Credit: 105,476 RAC: 0 |
In comparison with the L1/L2/L3-Caches of modern CPUs, writing to an HDD or SSD will still be "expensive". |
Send message Joined: 6 Nov 15 Posts: 16 Credit: 325,209 RAC: 0 |
Another restart for beta 1.03 under Windows 10. Doing CP It:3986402711.00000016:47:06 (6160): called boinc_finish(0) MName:CRLP2011_EPI MID:6 I'm about to abort this task. |