Rechenkraft.net e.V.

die erste Adresse für Distributed Computing
Unterstütze Rechenkraft.net e.V.
Mit Suchkraft hier suchen:
 
It is currently 24.11.14 07:39

All times are UTC + 1 hour [ DST ]




Post new topic Reply to topic  [ 725 posts ]  Go to page 1, 2, 3, 4, 5 ... 61  Next
Author Message
 Post subject: Long running work unit
Unread postPosted: 27.02.10 17:59 
Offline
Idle-Sammler
Idle-Sammler

Joined: 27.02.10 17:48
Posts: 6
http://www.rnaworld.de/rnaworld/workuni ... uid=157852

One of my machines picked up a rather long running work unit. The machine is a 2.4 GHz Q6600 with 4G of RAM running linux64 so should be powerful enough.

It is current at 6 days compute time and went high priority yesterday. The time to complete continues to increase and if it is correct, will no longer complete before the deadline. Two people have already aborted Workunit 157852, two people have had errors and 9 of us are still grinding away at it.

I read the thread that says a 2 GHz athlon will finish all work in a couple of hours. If this is true then shouldn't the work automatically abort after a day and not tie up a core for a week when you guys give us buggy work?

Are all work units that have run more than 2 hours buggy and need to be aborted? It seems wasteful to tie up resources for so long when they would be more productive doing something else.


Top
 Profile  
 
Unread postPosted: 27.02.10 19:41 
Offline
Vereinsvorstand
Vereinsvorstand
User avatar

Joined: 07.01.02 01:00
Posts: 16511
Location: anne Lahn
fractal wrote:
I read the thread that says a 2 GHz athlon will finish all work in a couple of hours. If this is true then shouldn't the work automatically abort after a day and not tie up a core for a week when you guys give us buggy work?

Well, what thread are you actually talking about? :o I ask, because I need to inform you that this information is completely incorrect. There are CMC work units out that require more than a week of full-time computing even on such a powerful machine as you luckily have one. Just as said in our FAQ. :wink: Of course, I cannot guarantee to you that it will finish in time because you did not give any information on the deadline but normally, this WU should be completable by your machine - this at least I can tell you. :D The progress bar is good but not strictly reliable since with the very long WUs indeed we observe a steady increase in run time estimation while the machine continues to compute. Later on, the remaining run time should start decreasing or the WU just suddenly fnishes validly. Please check in your Linux system monitor whether that WU is a zombie task or not. If not, I would keep it running. I hope this helps a bit.

Michael.

P.S.: We do not send out any buggy WU and no WU requires manual abortion. There are also only very few if any reports on errors with this project. Quite frequently, people just have not enough patience to wait for proper completion. In this case, it would help us to improve project overall performance by deleting early if a WU seems too long because by doing this we can send it out again quickly. Anything else will just delay the entire project progress.

_________________
Cooperation rather than competition is nature's most powerful principle. Or why do you think are eukaryotes the crown of evolution on earth?
Image


Top
 Profile  
 
Unread postPosted: 27.02.10 20:50 
Offline
Idle-Sammler
Idle-Sammler

Joined: 27.02.10 17:48
Posts: 6
I was referring to viewtopic.php?f=75&t=10592 . Re-reading it I see you said 47 hrs on a 2ghz AMD.

It is currently at 145 hrs of processing and time to complete has fluctuated between 10 and 12 hrs for the past two days. Report deadline is 5 hrs from now. You might want to take a look at that unit since, as I reported earlier, nobody has finished it and 64 bit linux cores that could be used to clean up the backlog of pending are tied up on it.


Top
 Profile  
 
Unread postPosted: 27.02.10 23:45 
Offline
Mikrocruncher
Mikrocruncher
User avatar

Joined: 28.07.09 18:49
Posts: 25
Location: 5335N 00745E
I also have two WorkUnits ( example ) that could not finish within the deadline.

Both take approx. 670hrs on a 2.5GHz Phenom II X4 905e, which isn't quite a slow CPU.
x86_64 Linux reports normal CPU usage, Task is running normal.

As the deadline is already blown, should I abort those or should I wait another ~340hrs for them to complete ?
(i.e. will the server still accept them ?)

-- edit --

Restarting them seems to have resetted their figures. I guess I'll keep an eye on how they proceed (?)

_________________
Scientific Network : 44800 MHz - 77824 MB - 1970 GB


Top
 Profile  
 
Unread postPosted: 28.02.10 01:22 
Offline
Vereinsvorstand
Vereinsvorstand
User avatar

Joined: 07.01.02 01:00
Posts: 16511
Location: anne Lahn
fractal wrote:
I was referring to viewtopic.php?f=75&t=10592 . Re-reading it I see you said 47 hrs on a 2ghz AMD.

No, you are referring to something unrelated. Here we have a CMC WU and not a CMS set of the CSP type which is the topic of the thread you link to.

fractal wrote:
It is currently at 145 hrs of processing and time to complete has fluctuated between 10 and 12 hrs for the past two days. Report deadline is 5 hrs from now. You might want to take a look at that unit since, as I reported earlier, nobody has finished it and 64 bit linux cores that could be used to clean up the backlog of pending are tied up on it.

Hmmm, that means you must have had that WU quite long in your queue without working on it, because the CMC WUs use to have a 14 day deadline. So, if you invested 145 hrs and the final deadline is in 5 hrs - well, you can calculate yourself. I do not know whether it will be completed in time, but I can do a test run on a 955 BE to see what run time is expected. Maybe just keep it running if you do not mind. It might really help us fix a problem.

Michael.

[edit]: Run time estimate on AMD 955 BE (3.0 GHz Quad): 85 hrs. I had this WU before, it took more than 145 hrs then my machine was accidentially detached from BOINC - restart. :worry: I noticed that building this WU takes 2.2 GB of RAM! Could you please have a look whether your machine is swapping RAM to disk due to memory limits? Is it possible that this causes the delay? In that case, we would need to assign higher RAM requirements than we use presently. I checked that the forecast runs of this WU use already 800 MB of memory. So, the original run will most liely require more. The problem s that RAM requirement of the CMC WUs cannot be precomputed. That is a problem and therefore we are in contact with the developers to have a RAM forecast function in a future version.

_________________
Cooperation rather than competition is nature's most powerful principle. Or why do you think are eukaryotes the crown of evolution on earth?
Image


Top
 Profile  
 
Unread postPosted: 28.02.10 02:12 
Offline
Vereinsvorstand
Vereinsvorstand
User avatar

Joined: 07.01.02 01:00
Posts: 16511
Location: anne Lahn
FalconFly wrote:
I also have two WorkUnits ( example ) that could not finish within the deadline.

Both take approx. 670hrs on a 2.5GHz Phenom II X4 905e, which isn't quite a slow CPU.
x86_64 Linux reports normal CPU usage, Task is running normal.

As the deadline is already blown, should I abort those or should I wait another ~340hrs for them to complete ?
(i.e. will the server still accept them ?)

-- edit --

Restarting them seems to have resetted their figures. I guess I'll keep an eye on how they proceed (?)

That is strange. My 955 BE estimates a run time of 244 hrs for this WU (will most likely be a bit more) with a RAM usage of up to 1.2 GB (might be higher, will increase and even change up and down during computation). Could you please also check for swapping of RAM to HD?

Michael.

_________________
Cooperation rather than competition is nature's most powerful principle. Or why do you think are eukaryotes the crown of evolution on earth?
Image


Top
 Profile  
 
Unread postPosted: 28.02.10 05:32 
Offline
Idle-Sammler
Idle-Sammler

Joined: 27.02.10 17:48
Posts: 6
I think you got it.

Quote:
boinc@g31mx-1:~$ free
total used free shared buffers cached
Mem: 4038988 3770556 268432 0 109184 357996
-/+ buffers/cache: 3303376 735612
Swap: 1253028 916248 336780


boinc ran some units from another project and switched to other units of the same project leaving them in memory. This eventually ate all of memory. I have suspended all other projects until this unit completes.

It is weird. I have told boinc not to switch between applications, but it just won't listen. It just loads them all up in memory and alternates between them. Silly program.


Top
 Profile  
 
Unread postPosted: 28.02.10 09:30 
Offline
Mikrocruncher
Mikrocruncher
User avatar

Joined: 28.07.09 18:49
Posts: 25
Location: 5335N 00745E
With me that wasn't the case I think (at least not yesterday when I checked on both machines).

cmcalibrate RAM uses only ~700MB, currently running along WCG (only 30MB per Task) and leaving some 2.8GB RAM free and 2GB Swap unused.

Even if cmcalibrate went up to 3.5GB, there still would be no impact on the systems (4GB RAM).

Since the next attempt is way out of the deadline (failed checkpoint set them back from 340h runtime to like 12min runtime), I think I'll just abort them like all the other Users had to.

_________________
Scientific Network : 44800 MHz - 77824 MB - 1970 GB


Top
 Profile  
 
Unread postPosted: 28.02.10 12:48 
Offline
Vereinsvorstand
Vereinsvorstand
User avatar

Joined: 07.01.02 01:00
Posts: 16511
Location: anne Lahn
No, please keep it running. If the machine is not swapping and no zombie task is detectable, I do not see a reason why the WU should be dead. It should count down, soon. I just had that sitation on my box.

Michael.

_________________
Cooperation rather than competition is nature's most powerful principle. Or why do you think are eukaryotes the crown of evolution on earth?
Image


Top
 Profile  
 
Unread postPosted: 28.02.10 18:44 
Offline
Mikrocruncher
Mikrocruncher
User avatar

Joined: 28.07.09 18:49
Posts: 25
Location: 5335N 00745E
Okidok, I'll let them run.
It'll be at least 4 weeks before they can finish, however - 19h done, ~650h to go :roll:

(I wonder what kind of calibration is done that needs such enormous runtimes?)

_________________
Scientific Network : 44800 MHz - 77824 MB - 1970 GB


Top
 Profile  
 
Unread postPosted: 11.03.10 19:41 
Offline
Fingerzähler
Fingerzähler

Joined: 11.03.10 19:27
Posts: 2
And another very long unit (513689, cms_6S6[e]_Monodelphis-domestica-(gray-short-tailed-opossum)_CM000370.lin.EMBL_f_1268060823_33_0), running on a Q6600/2.4GHz, but only 2GB RAM.

It's currently at 63¾ hours @ 9.1%, so ~637 hours/26½ days to go at the current rate; the progress bar is clicking up 0.001% per tick. The deadline is 18/3 (just under 7 days); it's showing 125 hours to go, so BOINC hasn't put it on high priority yet.

Abort (I hate wasting crunching time) or persevere?


Top
 Profile  
 
Unread postPosted: 11.03.10 20:12 
Online
Vereinsvorstand
Vereinsvorstand
User avatar

Joined: 17.12.02 14:09
Posts: 6799
Location: Berlin
Al Dente wrote:
And another very long unit (513689, cms_6S6[e]_Monodelphis-domestica-(gray-short-tailed-opossum)_CM000370.lin.EMBL_f_1268060823_33_0), running on a Q6600/2.4GHz, but only 2GB RAM.

It's currently at 63¾ hours @ 9.1%, so ~637 hours/26½ days to go at the current rate; the progress bar is clicking up 0.001% per tick. The deadline is 18/3 (just under 7 days); it's showing 125 hours to go, so BOINC hasn't put it on high priority yet.

Abort (I hate wasting crunching time) or persevere?

How much RAM does the wu consumes?
yoyo

_________________
HILF mit im Rechenkraft-WiKi, dies gibts zu tun.
Wiki - FAQ - Verein - Chat

Image Image


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 725 posts ]  Go to page 1, 2, 3, 4, 5 ... 61  Next

All times are UTC + 1 hour [ DST ]


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group