𝕏

Weird issue with "Ghost work units" on one host

Message boards : Number crunching : Weird issue with "Ghost work units" on one host
Message board moderation

To post messages, you must log in.

AuthorMessage
TPCBF

Send message
Joined: 11 Oct 23
Posts: 24
Credit: 3,033,494
RAC: 10,676
Message 2228 - Posted: 27 Nov 2023, 18:15:18 UTC

Just noticed this morning (Pacific Time) that the web site shows almost 40 WUs, send out on 11/24 23:04UTC for this particular host (Windows 10) as "In progress". However, no such DENIS WUs are actually showing up in the Tasks list of the BOINC client (yes, it is set to show "all" tasks, and it shows task from another BOINC project).

Any ideas what is going on?
ID: 2228 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TPCBF

Send message
Joined: 11 Oct 23
Posts: 24
Credit: 3,033,494
RAC: 10,676
Message 2229 - Posted: 28 Nov 2023, 3:09:19 UTC - in response to Message 2228.  

Just noticed this morning (Pacific Time) that the web site shows almost 40 WUs, send out on 11/24 23:04UTC for this particular host (Windows 10) as "In progress". However, no such DENIS WUs are actually showing up in the Tasks list of the BOINC client (yes, it is set to show "all" tasks, and it shows task from another BOINC project).

Any ideas what is going on?

Ok, EOD here in LA, 03:07UTC and now all those 38 WUs on this host have been removed and labeled as error "no reply" even though those did never show up on this machine....

:(
ID: 2229 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul

Send message
Joined: 8 Jul 22
Posts: 7
Credit: 964,279
RAC: 0
Message 2230 - Posted: 28 Nov 2023, 15:25:38 UTC - in response to Message 2228.  

23:00 UTC is about the time they shut down each day for a while so likely an artifact of that.
ISTR uploads complete during downtime but reports/requests get 3600 second backoff.
Paul.
ID: 2230 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TPCBF

Send message
Joined: 11 Oct 23
Posts: 24
Credit: 3,033,494
RAC: 10,676
Message 2231 - Posted: 28 Nov 2023, 15:27:53 UTC - in response to Message 2230.  

23:00 UTC is about the time they shut down each day for a while so likely an artifact of that.
ISTR uploads complete during downtime but reports/requests get 3600 second backoff.

Sorry Paul, but I don't see how this is related to the actual problem of those ghost WUs?

Ralf
ID: 2231 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul

Send message
Joined: 8 Jul 22
Posts: 7
Credit: 964,279
RAC: 0
Message 2232 - Posted: 28 Nov 2023, 18:50:31 UTC - in response to Message 2231.  

There could be a timing issue during shutdown or restart of BOINC services which means that server thinks units were sent but they were not.
I see similar issues often on ithena measurements as disk often fills up.
Paul.
ID: 2232 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 18 Nov 16
Posts: 14
Credit: 10,059,426
RAC: 162
Message 2233 - Posted: 28 Nov 2023, 21:26:14 UTC

I also saw this several days ago where the site showed tasks on a client that no longer had any tasks.
Today I do not see any ghost tasks.
ID: 2233 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TPCBF

Send message
Joined: 11 Oct 23
Posts: 24
Credit: 3,033,494
RAC: 10,676
Message 2234 - Posted: 28 Nov 2023, 23:31:27 UTC - in response to Message 2233.  

Well, the bottom line is that this seems to be something that needs to be fixed on the server side. The possible problem that is see is that the hosts (to none of their faults at all) at some point are going to be considered "unreliable" for the project... :(

Ralf
ID: 2234 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Vato
Avatar

Send message
Joined: 4 Aug 20
Posts: 6
Credit: 1,854,566
RAC: 1,635
Message 2235 - Posted: 6 Dec 2023, 15:35:18 UTC - in response to Message 2234.  

this is an issue with BOINC RPC protocol
if certain packets get lost in transit, the server has sent the tasks to the client and records that fact, the client never got them so they don't show up, and the protocol issue is that there is no acknowledgement to cope with the data loss
i believe there is a server option to resend lost tasks, but that it has been known to be awkward, so it isn't widely used
the "workaround" is just to wait until the server expires them and resends the tasks
ID: 2235 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 18 Nov 16
Posts: 14
Credit: 10,059,426
RAC: 162
Message 2236 - Posted: 7 Dec 2023, 21:39:29 UTC

I wouldn't be surprised if that was really triggered by this site being inaccessible at times or the DB being unavailable.
ID: 2236 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TPCBF

Send message
Joined: 11 Oct 23
Posts: 24
Credit: 3,033,494
RAC: 10,676
Message 2238 - Posted: 12 Dec 2023, 16:54:09 UTC - in response to Message 2236.  

Don't think that this was related to the time of the DB maintenance/stats update. It has happened only once so far, off by a couple of hours from that, and I didn't notice anything similar on any other BOINC project in the about 15 years that I crunch on several other projects. It might have been a problem +10 years ago on Rosetta, and possibly one of the reasons why I switched away from it 13 years ago and mainly to WCG (which now for +2 years is a completely different can of worms now :( ).
I am fairly new to DENIS, about two months now, as an alternate to WCG, and still try to get used to the project specific peculiarities here. So not sure if something like those ghost WUs is a more common issue.
We will see if it happens again, so far everything works rather smoothly when a new wave of WUs is available...

Ralf
ID: 2238 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile cuphi

Send message
Joined: 9 Aug 22
Posts: 10
Credit: 1,854,604
RAC: 14,220
Message 2257 - Posted: 15 Feb 2024, 23:39:46 UTC - in response to Message 2235.  

this is an issue with BOINC RPC protocol
if certain packets get lost in transit, the server has sent the tasks to the client and records that fact, the client never got them so they don't show up, and the protocol issue is that there is no acknowledgement to cope with the data loss
i believe there is a server option to resend lost tasks, but that it has been known to be awkward, so it isn't widely used
the "workaround" is just to wait until the server expires them and resends the tasks



This sounds plausible. Remote Procedure Calls (RPC) often use User Datagram Protocol (UPD) which which broadcasts packets and doesn't care if they get to where they should go. It's a well know tradeoff in the IP portion of TCP/IP.
ID: 2257 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TPCBF

Send message
Joined: 11 Oct 23
Posts: 24
Credit: 3,033,494
RAC: 10,676
Message 2258 - Posted: 16 Feb 2024, 5:20:26 UTC - in response to Message 2257.  

this is an issue with BOINC RPC protocol
if certain packets get lost in transit, the server has sent the tasks to the client and records that fact, the client never got them so they don't show up, and the protocol issue is that there is no acknowledgement to cope with the data loss
i believe there is a server option to resend lost tasks, but that it has been known to be awkward, so it isn't widely used
the "workaround" is just to wait until the server expires them and resends the tasks



This sounds plausible. Remote Procedure Calls (RPC) often use User Datagram Protocol (UPD) which which broadcasts packets and doesn't care if they get to where they should go. It's a well know tradeoff in the IP portion of TCP/IP.
Just rather weird that this happens with only one host out of about a dozen, in the same location, with a very high speed Internet connection.
Haven't seen it since either, though I have been too busy to keep an close eye on those machines. I prefer "fire and forget" ... ;-)

Ralf
ID: 2258 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Weird issue with "Ghost work units" on one host