Very long wus
log in

Advanced search

Message boards : Number crunching : Very long wus

1 · 2 · 3 · 4 · Next
Author Message
[VENETO] boboviz
Send message
Joined: 9 Apr 15
Posts: 108
Credit: 403,294
RAC: 0
Message 908 - Posted: 15 Jul 2016, 19:27:42 UTC

On my I5-5200 mobile, after 2h the wus are at 7%
Is it normal??

Col323
Send message
Joined: 5 Oct 15
Posts: 17
Credit: 1,335,501
RAC: 0
Message 909 - Posted: 15 Jul 2016, 19:46:29 UTC

You're not alone, at least. On my i5-5300 (mobile), currently being heavily used, the WU known as GD_jcarro_20160714202312000000_ThirdSimulations_SteadyState8000Schmidt98_conf_1513.xml_0 is at 4.55% after 1 hour 30 minutes.

This laptop has also crunched through 4 other WUs today, and averaged about 2 hours 30 minutes for each. Those were all "SteadyState6XX" units.

[VENETO] boboviz
Send message
Joined: 9 Apr 15
Posts: 108
Credit: 403,294
RAC: 0
Message 910 - Posted: 15 Jul 2016, 21:13:07 UTC

And seems to be no checkpoint.... :-(

[VENETO] boboviz
Send message
Joined: 9 Apr 15
Posts: 108
Credit: 403,294
RAC: 0
Message 911 - Posted: 15 Jul 2016, 21:14:32 UTC - in response to Message 909.

GD_jcarro_20160714202312000000_ThirdSimulations_SteadyState8000Schmidt98_conf_1513.xml_0 is at 4.55% after 1 hour 30 minutes.

This laptop has also crunched through 4 other WUs today, and averaged about 2 hours 30 minutes for each. Those were all "SteadyState6XX" units.


Same here with 8000 version

Dayle Diamond
Send message
Joined: 28 Apr 15
Posts: 12
Credit: 274,008
RAC: 0
Message 912 - Posted: 16 Jul 2016, 3:29:06 UTC

On my 4930k, I've got a bunch of WU's, with some on their 10th hour, and only a third completed. Steady state 8000.

rjs5
Send message
Joined: 3 Nov 15
Posts: 18
Credit: 473,952
RAC: 0
Message 913 - Posted: 16 Jul 2016, 3:52:19 UTC - in response to Message 911.

GD_jcarro_20160714202312000000_ThirdSimulations_SteadyState8000Schmidt98_conf_1513.xml_0 is at 4.55% after 1 hour 30 minutes.

This laptop has also crunched through 4 other WUs today, and averaged about 2 hours 30 minutes for each. Those were all "SteadyState6XX" units.


Same here with 8000 version



8-)

Once they are convinced that the algorithm is stable, they should be able to double or triple the performance.

The Windows code I looked at is using a ton of x87 FP math (ugh). They should be using at least the explicitly enabled SSE2 compiler options.

The jobs are spending 60% of the execution time in the "exp" function and a bulk of that time is being spent on the "F2XML st0" instruction which converts the contents of "st0" into 2^st0 - 1 ... microcode cycles like crazy.

I doubt that DENIS will make the source code available this time.

[VENETO] boboviz
Send message
Joined: 9 Apr 15
Posts: 108
Credit: 403,294
RAC: 0
Message 915 - Posted: 16 Jul 2016, 7:30:38 UTC - in response to Message 913.

I doubt that DENIS will make the source code available this time.


? Why?

Dayle Diamond
Send message
Joined: 28 Apr 15
Posts: 12
Credit: 274,008
RAC: 0
Message 917 - Posted: 16 Jul 2016, 14:06:30 UTC - in response to Message 915.

Okay, mine are averaging 16 hours each at this point and haven't finished. There's another thread showing somebody getting 'CPU time exceeded' with 24 hours of work.

Admins, do I cancel or what?

At this point, 183 CPU hours of work are up for grabs.

rjs5
Send message
Joined: 3 Nov 15
Posts: 18
Credit: 473,952
RAC: 0
Message 919 - Posted: 16 Jul 2016, 15:03:38 UTC - in response to Message 915.

I doubt that DENIS will make the source code available this time.


? Why?


Just my opinion ...

Having multiple copies of the applications floating around introduces additional workload on the project team that they do not have. They are running very thin on resources already.

When an "optimized" application generates a result, how do the project members using the answer ... know if the results are "correct" or not.

If the computed "answer" is what they expected ... is it really correct?
If the computed "answer" is not what they expected ... is it really wrong?

[VENETO] boboviz
Send message
Joined: 9 Apr 15
Posts: 108
Credit: 403,294
RAC: 0
Message 920 - Posted: 16 Jul 2016, 15:12:38 UTC
Last modified: 16 Jul 2016, 15:15:34 UTC

At the end of calculation, the wus restarted from 0% with this message:

16/07/2016 16:17:08 | DENIS@Home | Task GD_jcarro_20160714202301000000_ThirdSimulations_SteadyState8000Schmidt98_conf_1210.xml_1 exited with zero status but no 'finished' file
16/07/2016 16:17:08 | DENIS@Home | If this happens repeatedly you may need to reset the project.

I kill this wus.

May be the problem is this:
- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x00007ffad63d2d52

Engaging BOINC Windows Runtime Debugger...

vaughan
Send message
Joined: 9 Apr 15
Posts: 5
Credit: 7,050,653
RAC: 0
Message 921 - Posted: 16 Jul 2016, 15:38:44 UTC

I'm not liking these stupidly long running tasks.

Project Admin please do something to make them shorter ASAP.

rjs5
Send message
Joined: 3 Nov 15
Posts: 18
Credit: 473,952
RAC: 0
Message 922 - Posted: 16 Jul 2016, 15:49:36 UTC - in response to Message 917.

Okay, mine are averaging 16 hours each at this point and haven't finished. There's another thread showing somebody getting 'CPU time exceeded' with 24 hours of work.

Admins, do I cancel or what?

At this point, 183 CPU hours of work are up for grabs.



I am running on a (computer # 52542 ) Haswell i7-5930K CPU @ 3.50GHz and I am seeing the results fall into 3 general time buckets so far: 5,500, 11,000 and 24,000 seconds. I installed Windows updates and reboot while they were running and the application "restart" times looked rather funny.



State: All (50) · In progress (27) · Validation pending (9) · Validation inconclusive (7) · Valid (4) · Invalid (3) · Error (0)
http://denis.usj.es/denisathome/results.php?hostid=52542

Membran
Send message
Joined: 25 Apr 15
Posts: 2
Credit: 57,962
RAC: 0
Message 923 - Posted: 16 Jul 2016, 15:57:30 UTC

This WU's are to long and with no checkpointing! I stopp all!!

[VENETO] boboviz
Send message
Joined: 9 Apr 15
Posts: 108
Credit: 403,294
RAC: 0
Message 924 - Posted: 16 Jul 2016, 16:02:21 UTC - in response to Message 919.

When an "optimized" application generates a result, how do the project members using the answer ... know if the results are "correct" or not.

If the computed "answer" is what they expected ... is it really correct?
If the computed "answer" is not what they expected ... is it really wrong?


The other time, the "combination" of admins+volunteers worked well.
I don't know if Sesef and others have to re-write large part of the code to obtain optimization, or if they have to change only little tricks without interfree with scientific part of project. But i know that open source permits to see if there are problems/bugs/etc on code when admins team is little and with not a lot of resources.

newman
Send message
Joined: 10 Apr 15
Posts: 2
Credit: 167,274
RAC: 0
Message 925 - Posted: 16 Jul 2016, 16:47:26 UTC

Hm, my WUs where where also supposed to run for more than a day. But in the properties checkpointings was shown. I shut down my computer after 4.5 h runtime. When I restarted my computer all WUs jumped to 100% and where uploaded. Let´s see if they will be validated.

Marcus

Dayle Diamond
Send message
Joined: 28 Apr 15
Posts: 12
Credit: 274,008
RAC: 0
Message 928 - Posted: 17 Jul 2016, 3:10:05 UTC

I went ahead and let a lot of them just run to their natural completion.

Got home, and the ones that finished just restarted from the beginning.

They are checkpointing, they now have a much longer ETA. The longest is 628 days.

I'm going to have to cancel a LOT of work before it gets back to the server. Between the ones that just finished and rolled over and the ones I'm canceling because they're in the same batch and probably about to roll over, I've wasted 12 days, 16 hours and 14 minutes of CPU time.

Admins - if you are going to release long unstable work units, the very least you can do is communicate with your volunteers. This is insulting.

aybiss
Send message
Joined: 30 Apr 15
Posts: 6
Credit: 551,507
RAC: 0
Message 929 - Posted: 17 Jul 2016, 5:52:04 UTC

Yeah the WUs are just looping when they reach 100%. I'm cancelling mine.

Rickjb
Send message
Joined: 12 Jun 15
Posts: 5
Credit: 446,005
RAC: 0
Message 930 - Posted: 17 Jul 2016, 6:31:38 UTC - in response to Message 908.

Like others who've reported in this thread, I got some long-running 8000-series WUs, and I have now aborted them.
I have stopped accepting new work from DENIS on all but 1 computer until the problems with these WUs are fixed.
On the computer that is still accepting work from DENIS, I will abort any new 8000-series WUs even before they start running, but will run DENIS WUs that are not 8000-series.

Before I left my computers for a break of about 9h, the 8000-series WUs were showing over 80% complete after more than 20h run time each.
When I returned, they were still running and were showing only 23% and 0.07% complete, so I aborted them.

Aborted after 34h elapsed time (31h CPU time) showing as only 23% complete:
GD_jcarro_20160714202309000000_ThirdSimulations_SteadyState8000Schmidt98_conf_1441.xml_1
O/S: Windows 7-x64, CPU: Q9650 @ 3.6GHz

Aborted after 25h elapsed time (23h CPU time) showing as only 0.07% complete:
GD_jcarro_20160714202301000000_ThirdSimulations_SteadyState8000Schmidt98_conf_123.xml_0
O/S: Windows 7-x64, CPU: i7-3770K @ 4.0GHz, HT on.

It would seem that these WUs did what boboviz reported above in Post #920: "exited with zero status but no 'finished' file", and restarted from 0%.
In my experience on worldcommunitygrid, this error happens when something prevents a WU from sending a heartbeat message to the supervising BOINC client.
I can't recall ever seeing this with a WCG WU that's hung internally (eg infinite loop)(1,525,973 WUs completed), but it does happen if the O/S (Windows) pauses execution of a WU for too long due to some high-priority system event like waiting for slow disc activity.

It's time that we heard from the scientists on this problem, but I guess they are taking a break over the weekend.
But so soon after putting a new version of software and/or a new batch of WUs out here in DC-land, the scientists or techs should log in via the Net a few times over the weekend from wherever they are to check their project. Please, guys!

Col323
Send message
Joined: 5 Oct 15
Posts: 17
Credit: 1,335,501
RAC: 0
Message 931 - Posted: 17 Jul 2016, 11:26:59 UTC

My laptop had an 8000 series unit run for 22 hours and was at 45% complete. Out of curiosity, I restarted Boinc. Sure enough, progress went back to 0% even though runtime was cumulative. This laptop can be shut down multiple times in a 24 hour period, making 8000 units a non-starter.

However, another machine cranked through two 8000 units at 32 hours each. They hit 100% and reported successfully. They are now awaiting validation. (Fingers crossed!)

I would still like to contribute to Denis. Is it possible to have any of the following?

1) Select the type of work you would like via profile? (e.g. default = short, home = short, medium, long)
2) Optimizations?
3) Bonus points for working a long WU?

Jacob Klein
Send message
Joined: 8 Apr 15
Posts: 20
Credit: 50,254
RAC: 0
Message 932 - Posted: 17 Jul 2016, 12:52:46 UTC
Last modified: 17 Jul 2016, 12:53:44 UTC

From reading the stderr.txt files in the slots folders of some of my in-progress tasks... It seems that "resuming from checkpoint" is actually "restarting from scratch", on my 8000-series tasks.

Devs: Is that a bug?

I'm setting No New Tasks for now, as it is nearly impossible for me to guarantee nearly 24 hours of continuous runtime for the tasks, in my heavily-hyperthreaded PC.

I need "resuming from checkpoint" to be reliable.

1 · 2 · 3 · 4 · Next
Post to thread

Message boards : Number crunching : Very long wus


Main page · Your account · Message boards


Copyright © 2019 Universidad San Jorge