𝕏

Very long wus

Message boards : Number crunching : Very long wus
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Jacob Klein

Send message
Joined: 8 Apr 15
Posts: 20
Credit: 64,498
RAC: 0
Message 959 - Posted: 20 Jul 2016, 14:42:48 UTC
Last modified: 20 Jul 2016, 14:45:12 UTC

Maybe all the SteadyState ones are affected??
Make sure to test SteadState tasks.

Here's a SteadyState1000, that also didn't resume from checkpoint correctly:
http://denis.usj.es/denisathome/result.php?resultid=38373900

Here's a SteadyState2000, that also didn't resume from checkpoint correctly:
http://denis.usj.es/denisathome/result.php?resultid=38322366

Here's a SteadyState3000, that also didn't resume from checkpoint correctly:
http://denis.usj.es/denisathome/result.php?resultid=38325794
ID: 959 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Col323

Send message
Joined: 5 Oct 15
Posts: 17
Credit: 1,335,501
RAC: 0
Message 960 - Posted: 20 Jul 2016, 15:56:24 UTC
Last modified: 20 Jul 2016, 15:57:44 UTC

Well, I've picked up GD_jcarro_20160714202359000000_ThirdSimulations_SteadyState8000Schmidt98_conf_916.xml_7 (Yes, _7.) Units 0-6 are a mixed crew of 1 pending valid after 147,000 seconds, 2 Errors while computing, and 4 aborted by user. _6 was actually aborted by mm67 after 43,863.05 seconds, so I'm guessing it's another self-restarted/failed checkpoint restart WU. Since I don't think this machine is going to reboot in the next 36 hours, I'll see what happens.

(For what it's worth, it's running Win 7.)
ID: 960 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dayle Diamond

Send message
Joined: 28 Apr 15
Posts: 18
Credit: 2,027,862
RAC: 9,881
Message 961 - Posted: 20 Jul 2016, 16:09:53 UTC

And my tasks were on BOTH Windows 10 and Windows Vista (don't ask ><).
ID: 961 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile jcastro
Avatar

Send message
Joined: 16 Mar 15
Posts: 219
Credit: 14,859
RAC: 0
Message 963 - Posted: 20 Jul 2016, 17:25:59 UTC

The app is not doing anything special than store values of the current simulation. We have had to change the size of one parameter to store it as double because with longest simulations we overflow the size.

I think that could be related to the reading of the file. Because when it get rubbish from the checkpoint files ( is corrupted for example) the shows a it==XXX different from zero. But if the app can't read the file or it doesn't exists, it starts from zero.

We will do some test using parameters of the SteadyState8000 that seems to be more faulty if we can reproduce the problem.

Thanks for the info that you give us, is incredibly important in a project like DENIS@Home.

Best regards, Joel.
ID: 963 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Col323

Send message
Joined: 5 Oct 15
Posts: 17
Credit: 1,335,501
RAC: 0
Message 965 - Posted: 20 Jul 2016, 18:57:57 UTC - in response to Message 963.  

I appreciate you looking into this and your responsiveness. Please don't take my multiple postings about errors as complaining; I am hoping to be helpful. I understand that failed experiments are part of science. I will keep my machines attached and hope that we can iron out all bugs. Then hopefully we can get more people on board and do more science!
ID: 965 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jesús Carro
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 18 Mar 15
Posts: 269
Credit: 494,175
RAC: 78
Message 966 - Posted: 20 Jul 2016, 20:09:53 UTC - in response to Message 965.  

Thank you very much, Col323!!
Jesús Carro
Universidad San Jorge
@InSilicoHeart
ID: 966 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 8 Apr 15
Posts: 20
Credit: 64,498
RAC: 0
Message 967 - Posted: 21 Jul 2016, 3:43:29 UTC
Last modified: 21 Jul 2016, 3:44:59 UTC

Here's another v1.08 task that didn't function correctly.
http://denis.usj.es/denisathome/result.php?resultid=38374146

It looks like it:
- Checkpointed up to iteration: 1484766329
- I (presume that I) exited BOINC
- Loaded from checkpoint iteration: 1484766329
- Checkpointed up to iteration: 8456372551
- Called boinc_finish(0)
- Did not exit gracefully
- Continued by restarting the task all over, from iteration: 0
- Checkpointed up to iteration: 1204232792
- I aborted it.

I have again set "No New Tasks" for this project, since otherwise it's wasting my CPU resources. I hope you can do extra testing, to get a better handle on solving your issues.

For your reference, I do use the option "Leave non-GPU tasks in memory while suspended", and generally I use Activity->Suspend, before exiting BOINC. The PC that this task was aborted on, is on Windows 10 Insider Build 14393, with BOINC installed as a service. I'm attached to 50+ projects, get work from about 10-15 of them, and yours is the only one not working.

Good luck fixing it. I'll be monitoring this thread.

Thanks,
Jacob Klein
ID: 967 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 8 Apr 15
Posts: 20
Credit: 64,498
RAC: 0
Message 969 - Posted: 21 Jul 2016, 23:36:33 UTC
Last modified: 21 Jul 2016, 23:36:54 UTC

Here's another:
http://denis.usj.es/denisathome/result.php?resultid=38376733

After a couple "unnecessary full restarts", I let it run for a very long time continuously. It eventually called "boinc_finish(0)", and I watched it go from 100% to 0%, and keep wasting my CPU. I then aborted it.

Very frustrating. I hope you can figure it out.
ID: 969 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Col323

Send message
Joined: 5 Oct 15
Posts: 17
Credit: 1,335,501
RAC: 0
Message 973 - Posted: 22 Jul 2016, 2:38:33 UTC - in response to Message 960.  

Well, I've picked up GD_jcarro_20160714202359000000_ThirdSimulations_SteadyState8000Schmidt98_conf_916.xml_7 (Yes, _7.) Units 0-6 are a mixed crew of 1 pending valid after 147,000 seconds, 2 Errors while computing, and 4 aborted by user. _6 was actually aborted by mm67 after 43,863.05 seconds, so I'm guessing it's another self-restarted/failed checkpoint restart WU. Since I don't think this machine is going to reboot in the next 36 hours, I'll see what happens.

(For what it's worth, it's running Win 7.)


Well, after 34+ hours, it reset back to 0%. This WU will be aborted.
ID: 973 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile jcastro
Avatar

Send message
Joined: 16 Mar 15
Posts: 219
Credit: 14,859
RAC: 0
Message 974 - Posted: 22 Jul 2016, 10:21:23 UTC

Hello,

We know how frustrating it is. We have stop new simulations on CRLP2011, and we are going to do some work on the BETA. To test if the problem is solved. We are going also to see if we can modify the credit system to give you credit for the computation time and not just for results.

Your work as volunteer is great and we will see a way to reward your effort.

Best regards, Joel.
ID: 974 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 8 Apr 15
Posts: 20
Credit: 64,498
RAC: 0
Message 975 - Posted: 22 Jul 2016, 11:02:23 UTC

Thanks, Joel. You are making the right call, in regards to both suspending work and granting credits. Please note that I don't care at all about the credits. I care about my CPU cycles not being wasted. Once you have something for us to test, be sure to let us know in this thread, and I'd be more than happy to unset "No New Tasks" to help test!

- Jacob
ID: 975 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Col323

Send message
Joined: 5 Oct 15
Posts: 17
Credit: 1,335,501
RAC: 0
Message 976 - Posted: 22 Jul 2016, 11:50:18 UTC
Last modified: 22 Jul 2016, 11:54:18 UTC

I also just aborted WU GD_jcarro_20160714201506000000_ThirdSimulations_SteadyState3000Schmidt98_conf_1473.xml_5 which restarted from 0% after 20+ hours. This is the same machine where I aborted the 8000 WU above. Here are the system specs and other info from Boinc startup:

7/22/2016 7:33:04 AM | | Starting BOINC client version 7.2.47 for windows_x86_64
7/22/2016 7:33:04 AM | | log flags: file_xfer, sched_ops, task
7/22/2016 7:33:04 AM | | Libraries: libcurl/7.25.0 OpenSSL/1.0.1 zlib/1.2.6
7/22/2016 7:33:04 AM | | Data directory: C:\ProgramData\BOINC
7/22/2016 7:33:04 AM | | OpenCL: Intel GPU 0: Intel(R) HD Graphics 5500 (driver version 10.18.14.4029, device version OpenCL 2.0, 1298MB, 1298MB available, 58 GFLOPS peak)
7/22/2016 7:33:04 AM | | OpenCL CPU: Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz (OpenCL driver vendor: Intel(R) Corporation, driver version 4.2.0.130, device version OpenCL 2.0 (Build 130))
7/22/2016 7:33:04 AM | | Processor: 4 GenuineIntel Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz [Family 6 Model 61 Stepping 4]
7/22/2016 7:33:04 AM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes nx lm vmx smx tm2 pbe
7/22/2016 7:33:04 AM | | OS: Microsoft Windows 7: Professional x64 Edition, Service Pack 1, (06.01.7601.00)
7/22/2016 7:33:04 AM | | Memory: 7.69 GB physical, 15.37 GB virtual
7/22/2016 7:33:04 AM | | Disk: 238.47 GB total, 126.87 GB free
7/22/2016 7:33:04 AM | DENIS@Home | URL http://denis.usj.es/denisathome/; Computer ID 62844; resource share 0
7/22/2016 7:33:04 AM | WUProp@Home | URL http://wuprop.boinc-af.org/; Computer ID 89286; resource share 100
7/22/2016 7:33:04 AM | malariacontrol.net | URL http://www.malariacontrol.net/; Computer ID 1667973; resource share 0
7/22/2016 7:33:04 AM | World Community Grid | URL http://www.worldcommunitygrid.org/; Computer ID 3454626; resource share 0
7/22/2016 7:33:04 AM | World Community Grid | General prefs: from World Community Grid (last modified 04-Dec-2015 22:34:24)
7/22/2016 7:33:04 AM | World Community Grid | Computer location: school
7/22/2016 7:33:04 AM | | General prefs: using separate prefs for school
7/22/2016 7:33:04 AM | | Preferences:
7/22/2016 7:33:04 AM | | max memory usage when active: 5904.19MB
7/22/2016 7:33:04 AM | | max memory usage when idle: 7085.03MB
7/22/2016 7:33:04 AM | | max disk usage: 100.00GB
7/22/2016 7:33:04 AM | | (to change preferences, visit a project web site or select Preferences in the Manager)
7/22/2016 7:33:04 AM | | Not using a proxy


In good news, this machine has successfully completed and validated two WUs in the last 24 hours:
GD_jcarro_20160714201352000000_ThirdSimulations_SteadyState2000Schmidt98_conf_1329.xml_4
GD_jcarro_20160714202241000000_ThirdSimulations_SteadyState750Schmidt98_conf_697.xml_3

Not all is futile! And thank you for looking into credit for runtime. Like Jacob, points are not my main motivation for this, but it's nice to know that the project admins feel our pain. I will keep the gates open and look forward to more Beta units.
ID: 976 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[VENETO] boboviz

Send message
Joined: 9 Apr 15
Posts: 171
Credit: 1,371,098
RAC: 1,159
Message 977 - Posted: 22 Jul 2016, 12:18:00 UTC - in response to Message 974.  

We have stop new simulations on CRLP2011, and we are going to do some work on the BETA.


The beta is public? Can we help you??
ID: 977 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Robert Miles

Send message
Joined: 6 Nov 15
Posts: 16
Credit: 289,114
RAC: 286
Message 978 - Posted: 22 Jul 2016, 21:23:42 UTC

Two tasks in progress for me that currently look like my run time will be around 30 hours each, even though they were much faster for wingmates.

http://denis.usj.es/denisathome/result.php?resultid=38327994
http://denis.usj.es/denisathome/result.php?resultid=38327983
ID: 978 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 8 Apr 15
Posts: 20
Credit: 64,498
RAC: 0
Message 979 - Posted: 22 Jul 2016, 21:30:06 UTC

Robert:

Take a peek at the stderr.txt files in the slots folders for each of the tasks. Does it look like it started over on its own yet -- Called boinc_finish(0), then restarted all over, from iteration: 0 ?

If so, then the task already looped on you.
ID: 979 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Robert Miles

Send message
Joined: 6 Nov 15
Posts: 16
Credit: 289,114
RAC: 286
Message 980 - Posted: 23 Jul 2016, 1:44:07 UTC - in response to Message 979.  
Last modified: 23 Jul 2016, 1:46:01 UTC

Robert:

Take a peek at the stderr.txt files in the slots folders for each of the tasks. Does it look like it started over on its own yet -- Called boinc_finish(0), then restarted all over, from iteration: 0 ?

If so, then the task already looped on you.


The one that's run the longest (1d 03:36:13 so far) has started from iteration 0 three times, with boinc_finish(0) only for the second time. Appears to have written many checkpoints, but never even tried to restart from any of them.

The other one (08:00:33 so far) had a LoadFromCP: line and many Doing CP It: lines (neither appeared for the other task); it might have started over once with this line:

Doing CP It:1141721694.000000MName:CRLP2011_EPI

The LoadFromCP: line and many Doing CP It: lines appeared only after this line. For example:

CONFIG END
LoadFromCP: it = -6110892645033682256337252711852045995486982415347792110886177594213665993470218171985733728434033033587033496356699725256440570264016656431698385745737884464543906958658299879007483863999332353110867870092786887207335856942329392643006019576502181256283885945035465978876472831944354395374476697335574440100913969004208979437429562438677774977071428939853534541475566471267849288046304198734754166816489965632689457386820297658428675099090853412911405636771391450031250674115123612619557337130249336382937620409401265463957689022391515650389931161503345110237497753094939731649355659964714257267804169098109870728723040764861336992280276580898652295780437425500632570729433442377566631411950434096222547411386458811123191812111714332983839623878205988539575062909922470855412160084978819241882902718261716530388263900607241152634232759461569517343048983412070339426621470310714167931097596516890271446417098644016034660822574004287594162691221897348942721047501664150162228770868255706100368034040796435428828639038807423772798402936065423263483411447648897941541627810012386044507472646611100208405994844963076102531395747872299212095676624931764997195519296319715353220851595186953683263643824775496487963807746118653593060511074576561528025103706232301679246824056526208460350640570487199446185779603308357521660409037683723063944453248295021859025145645448793464757479007787819804029912671551324780314579610854877520942362471275800099082258138711914204001771932453599568925919340157672561099980018471234030290184323011182124536568585195789339121533055037884541379395543103017665986423621168653843928286983804063138055958293537749962652372790837532067922652962358211690051325939381235025568122757932355165528688971988072450778342090576020207125710955595761829836850797441693285356857911814932360859682379609480974874935850921286246110592571222216116133627790444933668553883249905913831378094019158539908961899466489238569224183509003239792185068178413230751682438124167267860558336731748539133663243651862058957234962281361100085052373855738773432972529285124458453311576088997489353933995968566904939637411159037884497920.000000
Doing CP It:-6110892645033682256337252711852045995486982415347792110886177594213665993470218171985733728434033033587033496356699725256440570264016656431698385745737884464543906958658299879007483863999332353110867870092786887207335856942329392643006019576502181256283885945035465978876472831944354395374476697335574440100913969004208979437429562438677774977071428939853534541475566471267849288046304198734754166816489965632689457386820297658428675099090853412911405636771391450031250674115123612619557337130249336382937620409401265463957689022391515650389931161503345110237497753094939731649355659964714257267804169098109870728723040764861336992280276580898652295780437425500632570729433442377566631411950434096222547411386458811123191812111714332983839623878205988539575062909922470855412160084978819241882902718261716530388263900607241152634232759461569517343048983412070339426621470310714167931097596516890271446417098644016034660822574004287594162691221897348942721047501664150162228770868255706100368034040796435428828639038807423772798402936065423263483411447648897941541627810012386044507472646611100208405994844963076102531395747872299212095676624931764997195519296319715353220851595186953683263643824775496487963807746118653593060511074576561528025103706232301679246824056526208460350640570487199446185779603308357521660409037683723063944453248295021859025145645448793464757479007787819804029912671551324780314579610854877520942362471275800099082258138711914204001771932453599568925919340157672561099980018471234030290184323011182124536568585195789339121533055037884541379395543103017665986423621168653843928286983804063138055958293537749962652372790837532067922652962358211690051325939381235025568122757932355165528688971988072450778342090576020207125710955595761829836850797441693285356857911814932360859682379609480974874935850921286246110592571222216116133627790444933668553883249905913831378094019158539908961899466489238569224183509003239792185068178413230751682438124167267860558336731748539133663243651862058957234962281361100085052373855738773432972529285124458453311576088997489353933995968566904939637411159037884497920.000000
Doing CP It:-6110892645033682256337252711852045995486982415347792110886177594213665993470218171985733728434033033587033496356699725256440570264016656431698385745737884464543906958658299879007483863999332353110867870092786887207335856942329392643006019576502181256283885945035465978876472831944354395374476697335574440100913969004208979437429562438677774977071428939853534541475566471267849288046304198734754166816489965632689457386820297658428675099090853412911405636771391450031250674115123612619557337130249336382937620409401265463957689022391515650389931161503345110237497753094939731649355659964714257267804169098109870728723040764861336992280276580898652295780437425500632570729433442377566631411950434096222547411386458811123191812111714332983839623878205988539575062909922470855412160084978819241882902718261716530388263900607241152634232759461569517343048983412070339426621470310714167931097596516890271446417098644016034660822574004287594162691221897348942721047501664150162228770868255706100368034040796435428828639038807423772798402936065423263483411447648897941541627810012386044507472646611100208405994844963076102531395747872299212095676624931764997195519296319715353220851595186953683263643824775496487963807746118653593060511074576561528025103706232301679246824056526208460350640570487199446185779603308357521660409037683723063944453248295021859025145645448793464757479007787819804029912671551324780314579610854877520942362471275800099082258138711914204001771932453599568925919340157672561099980018471234030290184323011182124536568585195789339121533055037884541379395543103017665986423621168653843928286983804063138055958293537749962652372790837532067922652962358211690051325939381235025568122757932355165528688971988072450778342090576020207125710955595761829836850797441693285356857911814932360859682379609480974874935850921286246110592571222216116133627790444933668553883249905913831378094019158539908961899466489238569224183509003239792185068178413230751682438124167267860558336731748539133663243651862058957234962281361100085052373855738773432972529285124458453311576088997489353933995968566904939637411159037884497920.000000

I'm unable to tell whether those lines indicate valid iteration numbers or not.

Neither task has anything I recognize as a checkpoint file.
ID: 980 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 8 Apr 15
Posts: 20
Credit: 64,498
RAC: 0
Message 981 - Posted: 23 Jul 2016, 2:03:26 UTC - in response to Message 980.  
Last modified: 23 Jul 2016, 2:05:16 UTC

:) That one with the large negative iteration numbers ... looks like a v1.07 task, where they had problems where the iteration number was being stored in a variable that was too small (hence the large negative numbers).

You need to cancel your v1.07 tasks, and get some v1.08 tasks.

But, warning, when you get the v1.08 tasks, I'm betting you'll possibly encounter the same issues some of us are seeing -- 1) not resuming from checkpoints, and 2) looping instead of completing.

I've set "No New Tasks". Devs are trying to solve those problems, listed throughout this thread.
ID: 981 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile jcastro
Avatar

Send message
Joined: 16 Mar 15
Posts: 219
Credit: 14,859
RAC: 0
Message 982 - Posted: 23 Jul 2016, 9:05:28 UTC - in response to Message 977.  

We have stop new simulations on CRLP2011, and we are going to do some work on the BETA.


The beta is public? Can we help you??


To help us with the beta, you have to mark "Run test applications?" in "DENIS@Home Preferences" in your profile.

We use this separate app to launch few simulations with strange parameters or new features that we want to test before add it into the "stable/official" DENIS simulator.

Best regards, Joel.
ID: 982 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 8 Apr 15
Posts: 20
Credit: 64,498
RAC: 0
Message 983 - Posted: 23 Jul 2016, 11:03:29 UTC
Last modified: 23 Jul 2016, 11:09:30 UTC

I have changed my web setting, to allow Beta, so I can help test.
It downloaded some "Beta v1.03" tasks. Is this a new executable (with potential fixes), or an old one (from a prior beta test)?

Also, you should consider putting release notes in the News section, for each public and each beta release.

Edit:
Checkpointing still is not working, on the Beta, per below. Dang.
Doing CP It:8630444.000000
Doing CP It:17270947.000000
Doing CP It:25836125.000000
Doing CP It:34510980.000000

... restarted BOINC ...
Checkpoint file not found
Doing CP It:8609762.000000
ID: 983 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dr. Merkwürdigliebe

Send message
Joined: 23 Dec 15
Posts: 6
Credit: 105,476
RAC: 0
Message 984 - Posted: 23 Jul 2016, 11:24:59 UTC

Beta v1.03 tasks: 20h runtime, not bad...
ID: 984 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Very long wus