Weird issue with "Ghost work units" on one host
Message boards :
Number crunching :
Weird issue with "Ghost work units" on one host
Message board moderation
Author | Message |
---|---|
Send message Joined: 11 Oct 23 Posts: 25 Credit: 3,978,292 RAC: 0 |
Just noticed this morning (Pacific Time) that the web site shows almost 40 WUs, send out on 11/24 23:04UTC for this particular host (Windows 10) as "In progress". However, no such DENIS WUs are actually showing up in the Tasks list of the BOINC client (yes, it is set to show "all" tasks, and it shows task from another BOINC project). Any ideas what is going on? |
Send message Joined: 11 Oct 23 Posts: 25 Credit: 3,978,292 RAC: 0 |
Just noticed this morning (Pacific Time) that the web site shows almost 40 WUs, send out on 11/24 23:04UTC for this particular host (Windows 10) as "In progress". However, no such DENIS WUs are actually showing up in the Tasks list of the BOINC client (yes, it is set to show "all" tasks, and it shows task from another BOINC project). Ok, EOD here in LA, 03:07UTC and now all those 38 WUs on this host have been removed and labeled as error "no reply" even though those did never show up on this machine.... :( |
Send message Joined: 8 Jul 22 Posts: 9 Credit: 964,279 RAC: 0 |
23:00 UTC is about the time they shut down each day for a while so likely an artifact of that. ISTR uploads complete during downtime but reports/requests get 3600 second backoff. Paul. |
Send message Joined: 11 Oct 23 Posts: 25 Credit: 3,978,292 RAC: 0 |
23:00 UTC is about the time they shut down each day for a while so likely an artifact of that. Sorry Paul, but I don't see how this is related to the actual problem of those ghost WUs? Ralf |
Send message Joined: 8 Jul 22 Posts: 9 Credit: 964,279 RAC: 0 |
There could be a timing issue during shutdown or restart of BOINC services which means that server thinks units were sent but they were not. I see similar issues often on ithena measurements as disk often fills up. Paul. |
Send message Joined: 18 Nov 16 Posts: 16 Credit: 11,221,467 RAC: 0 |
I also saw this several days ago where the site showed tasks on a client that no longer had any tasks. Today I do not see any ghost tasks. |
Send message Joined: 11 Oct 23 Posts: 25 Credit: 3,978,292 RAC: 0 |
Well, the bottom line is that this seems to be something that needs to be fixed on the server side. The possible problem that is see is that the hosts (to none of their faults at all) at some point are going to be considered "unreliable" for the project... :( Ralf |
Send message Joined: 4 Aug 20 Posts: 6 Credit: 1,974,541 RAC: 0 |
this is an issue with BOINC RPC protocol if certain packets get lost in transit, the server has sent the tasks to the client and records that fact, the client never got them so they don't show up, and the protocol issue is that there is no acknowledgement to cope with the data loss i believe there is a server option to resend lost tasks, but that it has been known to be awkward, so it isn't widely used the "workaround" is just to wait until the server expires them and resends the tasks |
Send message Joined: 18 Nov 16 Posts: 16 Credit: 11,221,467 RAC: 0 |
I wouldn't be surprised if that was really triggered by this site being inaccessible at times or the DB being unavailable. |
Send message Joined: 11 Oct 23 Posts: 25 Credit: 3,978,292 RAC: 0 |
Don't think that this was related to the time of the DB maintenance/stats update. It has happened only once so far, off by a couple of hours from that, and I didn't notice anything similar on any other BOINC project in the about 15 years that I crunch on several other projects. It might have been a problem +10 years ago on Rosetta, and possibly one of the reasons why I switched away from it 13 years ago and mainly to WCG (which now for +2 years is a completely different can of worms now :( ). I am fairly new to DENIS, about two months now, as an alternate to WCG, and still try to get used to the project specific peculiarities here. So not sure if something like those ghost WUs is a more common issue. We will see if it happens again, so far everything works rather smoothly when a new wave of WUs is available... Ralf |
Send message Joined: 9 Aug 22 Posts: 10 Credit: 2,219,354 RAC: 0 |
this is an issue with BOINC RPC protocol This sounds plausible. Remote Procedure Calls (RPC) often use User Datagram Protocol (UPD) which which broadcasts packets and doesn't care if they get to where they should go. It's a well know tradeoff in the IP portion of TCP/IP. |
Send message Joined: 11 Oct 23 Posts: 25 Credit: 3,978,292 RAC: 0 |
Just rather weird that this happens with only one host out of about a dozen, in the same location, with a very high speed Internet connection.this is an issue with BOINC RPC protocol Haven't seen it since either, though I have been too busy to keep an close eye on those machines. I prefer "fire and forget" ... ;-) Ralf |