𝕏

Very long wus

Message boards : Number crunching : Very long wus
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Robert Miles

Send message
Joined: 6 Nov 15
Posts: 16
Credit: 325,209
RAC: 0
Message 985 - Posted: 23 Jul 2016, 19:01:26 UTC - in response to Message 981.  

:) That one with the large negative iteration numbers ... looks like a v1.07 task, where they had problems where the iteration number was being stored in a variable that was too small (hence the large negative numbers).

You need to cancel your v1.07 tasks, and get some v1.08 tasks.


Both were 1.07; both now aborted.
ID: 985 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
stevenluet

Send message
Joined: 24 Apr 15
Posts: 1
Credit: 281,700
RAC: 0
Message 993 - Posted: 25 Jul 2016, 8:26:46 UTC

I have a bunch of V1.03 with 1d17h and 81%.
ID: 993 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Col323

Send message
Joined: 5 Oct 15
Posts: 17
Credit: 1,335,501
RAC: 0
Message 994 - Posted: 25 Jul 2016, 11:34:54 UTC

BETA_23071039_8000_0173_0 looped on me (on the aforementioned Win 7 laptop), so I aborted it. This machine is also running two other Beta units. One has also looped, but I'm going to see if the second loop finishes correctly. The other unit has not looped, and I'm hoping that somehow it finishes gracefully.
ID: 994 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile sir spuddly buddly
Avatar

Send message
Joined: 30 Aug 15
Posts: 47
Credit: 1,248,591
RAC: 0
Message 995 - Posted: 25 Jul 2016, 11:59:27 UTC

I have 2 machines, one with Fedora ans one with Windows 7
Fedora machine has no problems with rebooting,Windows machine restarts for 0 after a reboot (with Beta work)
LZ Loon
The Most Handsome Man on the Interweb (TM)
ID: 995 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Thyme Lawn

Send message
Joined: 24 Oct 15
Posts: 16
Credit: 595,047
RAC: 0
Message 996 - Posted: 25 Jul 2016, 15:38:50 UTC
Last modified: 25 Jul 2016, 15:40:06 UTC

I used SysInternals Process Monitor to check what the beta application does in the slot directory during a checkpoint. The sequence is as follows:

  1. File "denis_state" is not found.
  2. Creates the file "temp" with "Desired Access: Generic Write, Read Attributes" and "ShareMode: Read, Write". Successful with "OpenResult: Overwritten". Does this twice.
  3. Writes the "CP" line to "stderr.txt".
  4. Writes 1280 bytes to "temp" then successfully closes it once.
  5. Attempts to create "temp" with "Desired Access: Read Attributes, Delete, Synchronize", "Disposition: Open", "ShareMode: Read, Write, Delete". This causes a sharing violation (repeated on 5 further attempts) because the file was opened twice with "ShareMode: Read, Write", only closed once and the new open attempt is trying to add "Delete" to "ShareMode".


The obvious solution to this is to remove the spurious open of "temp".


"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 996 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile jcastro
Avatar

Send message
Joined: 16 Mar 15
Posts: 219
Credit: 14,859
RAC: 0
Message 999 - Posted: 26 Jul 2016, 12:37:16 UTC - in response to Message 996.  

I used SysInternals Process Monitor to check what the beta application does in the slot directory during a checkpoint. The sequence is as follows:

  1. File "denis_state" is not found.
  2. Creates the file "temp" with "Desired Access: Generic Write, Read Attributes" and "ShareMode: Read, Write". Successful with "OpenResult: Overwritten". Does this twice.
  3. Writes the "CP" line to "stderr.txt".
  4. Writes 1280 bytes to "temp" then successfully closes it once.
  5. Attempts to create "temp" with "Desired Access: Read Attributes, Delete, Synchronize", "Disposition: Open", "ShareMode: Read, Write, Delete". This causes a sharing violation (repeated on 5 further attempts) because the file was opened twice with "ShareMode: Read, Write", only closed once and the new open attempt is trying to add "Delete" to "ShareMode".


The obvious solution to this is to remove the spurious open of "temp".


Thanks for the info, we are trying different strategies to checkpoint to avoid the problems we have. Maybe this solution is no the correct. We are modifing a bit more.

Best regards, Joel.
ID: 999 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Robert Miles

Send message
Joined: 6 Nov 15
Posts: 16
Credit: 325,209
RAC: 0
Message 1000 - Posted: 26 Jul 2016, 12:42:48 UTC

Last night, I decided to check whether beta 1.03 could recover from checkpoints, so I shut down BOINC and then did a Windows restart on one of my computers. The task started up again, could not find the checkpoint file, and restarted from the beginning. I do not expect it to finish by the deadline.
ID: 1000 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 5 Jul 15
Posts: 3
Credit: 1,074,712
RAC: 0
Message 1001 - Posted: 26 Jul 2016, 14:17:28 UTC

For what it's worth, I run Linux and if I select a task in boinc manager and look at the properties, one of the things it lists is the cpu time at the last checkpoint. If that's blank, I know that it has not checkpointed.

I've also set no new work for DENIS on all my systems until I hear that these problems have been fixed.
ID: 1001 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile sir spuddly buddly
Avatar

Send message
Joined: 30 Aug 15
Posts: 47
Credit: 1,248,591
RAC: 0
Message 1003 - Posted: 26 Jul 2016, 19:19:59 UTC - in response to Message 1002.  
Last modified: 26 Jul 2016, 19:24:25 UTC

Seems you have to PM a mod to get a post deleted.
LZ Loon
The Most Handsome Man on the Interweb (TM)
ID: 1003 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Robert Miles

Send message
Joined: 6 Nov 15
Posts: 16
Credit: 325,209
RAC: 0
Message 1005 - Posted: 26 Jul 2016, 23:17:29 UTC
Last modified: 26 Jul 2016, 23:23:26 UTC

Some things I've seen on other BOINC projects:

1. Two separate checkpoint files per task, so that if the task is interrupted in the middle of writing a checkpoint, it can resume from the other one instead. This needs a method for determining which of the files has the more recent checkpoint, in case both contain valid checkpoints.

2. Something identifying which checkpoint is written placed at the beginning, which must match something at the end to indicate that the checkpoint is complete. Whatever is used should not match any other checkpoints from the same task.

3. Requests from the project for users to deliberately interrupt BOINC during beta tests, to check whether the checkpoints are working properly.

Windows 10 has at least one update most Tuesday nights, with delaying them difficult, so you may soon see many more checkpoint resume failures from Windows 10 users.
ID: 1005 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matthias Lehmkuhl

Send message
Joined: 1 Jul 15
Posts: 2
Credit: 243,560
RAC: 0
Message 1011 - Posted: 27 Jul 2016, 13:53:22 UTC

I've got also an never ending result on a machine that is shut down every day.
GD_jcarro_20160714202331000000_ThirdSimulations_SteadyState8000Schmidt98_conf_1983.xml
Result: 38377099
Carro-Rodriguez-Laguna-Pueyo Epicardial Model (Carro et al. 2011) for human ventricular cells v1.08
On starting the computer at morning the result started calculation again from round 20%.
System Windows 7

Looks like there was a checkpoint lost (extract from <stderr_txt>)
Line 112 : LoadFromCP: it = 5956108503.000000
Line 263 : LoadFromCP: it = 6800888101.000000
Line 805 : LoadFromCP: it = 0.000000
Line 1217: LoadFromCP: it = 2471613195.000000
Matthias
ID: 1011 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Robert Miles

Send message
Joined: 6 Nov 15
Posts: 16
Credit: 325,209
RAC: 0
Message 1019 - Posted: 28 Jul 2016, 13:07:39 UTC - in response to Message 1000.  

Last night, I decided to check whether beta 1.03 could recover from checkpoints, so I shut down BOINC and then did a Windows restart on one of my computers. The task started up again, could not find the checkpoint file, and restarted from the beginning. I do not expect it to finish by the deadline.


It looks like beta 1.03 does NOT fix the endless restarts problem.

A few lines from stderr.txt showing this:

Doing CP It:3396376837.000000
Doing CP It:3402929064.00000000:35:28 (2660): called boinc_finish(0)
MName:CRLP2011_EPI
MID:6

At least this time it created a file named temp, AFTER it restarted.

Task 38385075
ID: 1019 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Robert Miles

Send message
Joined: 6 Nov 15
Posts: 16
Credit: 325,209
RAC: 0
Message 1026 - Posted: 29 Jul 2016, 0:13:25 UTC
Last modified: 29 Jul 2016, 0:24:45 UTC

On my other computer with a beta 1.03 task in progress, there is a temp file, but it's empty - 0 bytes.

47.338% progress, so it should have created many checkpoints by now.

Task 38385601

Microsoft does not appear to have forced any Tuesday night Windows 10 updates requiring a Windows restart this week, so it may still complete without trying to resume from a checkpoint.

A suggested feature for the next beta: For the first few checkpoints, report whether the checkpoint file even exists before trying to write the checkpoint, and if so, how many bytes it contains. You could also have the first few checkpoints report whether they were successful in writing to the checkpoint file.
ID: 1026 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile jcastro
Avatar

Send message
Joined: 16 Mar 15
Posts: 219
Credit: 14,859
RAC: 0
Message 1028 - Posted: 29 Jul 2016, 11:15:47 UTC - in response to Message 1026.  
Last modified: 29 Jul 2016, 11:15:59 UTC

Hi!

We are thinking different ways to ensure checkpoints. As you have said, a double checkpoint file as a backup and check the size of the file is one of the ways that are on the table.

Writing into disk is one of the most "expensive" things in the execution of the simulation, this is why at the beginning we have tried to simplify this part. But with those ultra-long WUs it could be necessary.

Best regards, Joel.
ID: 1028 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Robert Miles

Send message
Joined: 6 Nov 15
Posts: 16
Credit: 325,209
RAC: 0
Message 1031 - Posted: 29 Jul 2016, 17:03:39 UTC

The beta 1.03 task under Windows 10 also started the infinite reruns.

Doing CP It:3980371292.000000
Doing CP It:3988241479.00000010:58:46 (8512): called boinc_finish(0)
MName:CRLP2011_EPI
MID:6

Even though the temp file was previously present and not empty, it could not find it after this restart.

Your next beta version appears to need to be able to tell whether the checkpoint file is not present, present but empty, present and not empty but with the contents not usable, and some other error in reading it.

The beta 1.03 task under Windows Vista started another rerun. When should I shut this down by aborting it? The temp file is still present but only 0 bytes.

Do you have a way to make your next batch of Windows tasks use a much smaller number of iterations, so they can be aimed at testing whether the Windows version of the application shuts down properly? Also, they should send any remaining temp file back to you, so you can check if it's even in the right format.

I believe I've read that some compilers allow creating files in such a way that they will automatically be deleted when the application that created them ends. You may need to inspect the Windows application to see if it does this.

Also, does the Windows application close the checkpoint file after it finishes writing a checkpoint? If not, I'd expect the checkpoint file to be more likely to be lost, since it might be smaller than the hard drive block size, and therefore still in cache but not yet written to the hard drive.
ID: 1031 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[VENETO] boboviz

Send message
Joined: 9 Apr 15
Posts: 173
Credit: 1,552,856
RAC: 0
Message 1032 - Posted: 29 Jul 2016, 21:46:38 UTC - in response to Message 1028.  

Writing into disk is one of the most "expensive" things in the execution of the simulation, this is why at the beginning we have tried to simplify this part. But with those ultra-long WUs it could be necessary.


I will buy an SSD!
ID: 1032 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Robert Miles

Send message
Joined: 6 Nov 15
Posts: 16
Credit: 325,209
RAC: 0
Message 1033 - Posted: 30 Jul 2016, 2:44:56 UTC - in response to Message 1032.  
Last modified: 30 Jul 2016, 2:46:24 UTC

Writing into disk is one of the most "expensive" things in the execution of the simulation, this is why at the beginning we have tried to simplify this part. But with those ultra-long WUs it could be necessary.


I will buy an SSD!


I already have an SSD in each of my desktops. However, the one for my Windows Vista computer does not have a Windows Vista driver available. You may want to check for a similar problem before you buy any.
ID: 1033 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[VENETO] boboviz

Send message
Joined: 9 Apr 15
Posts: 173
Credit: 1,552,856
RAC: 0
Message 1036 - Posted: 30 Jul 2016, 16:28:10 UTC - in response to Message 1033.  

I will buy an SSD!


I already have an SSD in each of my desktops. However, the one for my Windows Vista computer does not have a Windows Vista driver available. You may want to check for a similar problem before you buy any.


My pc are Win10
And i will use Acronis to clone disk to ssd....
ID: 1036 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dr. Merkwürdigliebe

Send message
Joined: 23 Dec 15
Posts: 6
Credit: 105,476
RAC: 0
Message 1037 - Posted: 30 Jul 2016, 17:07:07 UTC

In comparison with the L1/L2/L3-Caches of modern CPUs, writing to an HDD or SSD will still be "expensive".
ID: 1037 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Robert Miles

Send message
Joined: 6 Nov 15
Posts: 16
Credit: 325,209
RAC: 0
Message 1039 - Posted: 30 Jul 2016, 22:35:09 UTC

Another restart for beta 1.03 under Windows 10.

Doing CP It:3986402711.00000016:47:06 (6160): called boinc_finish(0)
MName:CRLP2011_EPI
MID:6

I'm about to abort this task.
ID: 1039 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : Number crunching : Very long wus