Long running work unit

Everything about the project RNA World

Long running work unit

Unread postby fractal » 27.02.2010 17:59

http://www.rnaworld.de/rnaworld/workuni ... uid=157852

One of my machines picked up a rather long running work unit. The machine is a 2.4 GHz Q6600 with 4G of RAM running linux64 so should be powerful enough.

It is current at 6 days compute time and went high priority yesterday. The time to complete continues to increase and if it is correct, will no longer complete before the deadline. Two people have already aborted Workunit 157852, two people have had errors and 9 of us are still grinding away at it.

I read the thread that says a 2 GHz athlon will finish all work in a couple of hours. If this is true then shouldn't the work automatically abort after a day and not tie up a core for a week when you guys give us buggy work?

Are all work units that have run more than 2 hours buggy and need to be aborted? It seems wasteful to tie up resources for so long when they would be more productive doing something else.
fractal
Idle-Sammler
Idle-Sammler
 
Posts: 6
Joined: 27.02.2010 17:48

Re: Long running work unit

Unread postby Michael H.W. Weber » 27.02.2010 19:41

fractal wrote:I read the thread that says a 2 GHz athlon will finish all work in a couple of hours. If this is true then shouldn't the work automatically abort after a day and not tie up a core for a week when you guys give us buggy work?

Well, what thread are you actually talking about? :o I ask, because I need to inform you that this information is completely incorrect. There are CMC work units out that require more than a week of full-time computing even on such a powerful machine as you luckily have one. Just as said in our FAQ. :wink: Of course, I cannot guarantee to you that it will finish in time because you did not give any information on the deadline but normally, this WU should be completable by your machine - this at least I can tell you. :D The progress bar is good but not strictly reliable since with the very long WUs indeed we observe a steady increase in run time estimation while the machine continues to compute. Later on, the remaining run time should start decreasing or the WU just suddenly fnishes validly. Please check in your Linux system monitor whether that WU is a zombie task or not. If not, I would keep it running. I hope this helps a bit.

Michael.

P.S.: We do not send out any buggy WU and no WU requires manual abortion. There are also only very few if any reports on errors with this project. Quite frequently, people just have not enough patience to wait for proper completion. In this case, it would help us to improve project overall performance by deleting early if a WU seems too long because by doing this we can send it out again quickly. Anything else will just delay the entire project progress.
Fördern, nicht fordern. Kooperieren statt konkurrieren. Konstruieren statt konsumieren.

Image

Image Image Image
User avatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
 
Posts: 18592
Joined: 07.01.2002 01:00
Location: Marpurk

Re: Long running work unit

Unread postby fractal » 27.02.2010 20:50

I was referring to viewtopic.php?f=75&t=10592 . Re-reading it I see you said 47 hrs on a 2ghz AMD.

It is currently at 145 hrs of processing and time to complete has fluctuated between 10 and 12 hrs for the past two days. Report deadline is 5 hrs from now. You might want to take a look at that unit since, as I reported earlier, nobody has finished it and 64 bit linux cores that could be used to clean up the backlog of pending are tied up on it.
fractal
Idle-Sammler
Idle-Sammler
 
Posts: 6
Joined: 27.02.2010 17:48

Re: Long running work unit

Unread postby FalconFly » 27.02.2010 23:45

I also have two WorkUnits ( example ) that could not finish within the deadline.

Both take approx. 670hrs on a 2.5GHz Phenom II X4 905e, which isn't quite a slow CPU.
x86_64 Linux reports normal CPU usage, Task is running normal.

As the deadline is already blown, should I abort those or should I wait another ~340hrs for them to complete ?
(i.e. will the server still accept them ?)

-- edit --

Restarting them seems to have resetted their figures. I guess I'll keep an eye on how they proceed (?)
Scientific Network : 44800 MHz - 77824 MB - 1970 GB
User avatar
FalconFly
Mikrocruncher
Mikrocruncher
 
Posts: 25
Joined: 28.07.2009 18:49
Location: 5335N 00745E

Re: Long running work unit

Unread postby Michael H.W. Weber » 28.02.2010 01:22

fractal wrote:I was referring to viewtopic.php?f=75&t=10592 . Re-reading it I see you said 47 hrs on a 2ghz AMD.

No, you are referring to something unrelated. Here we have a CMC WU and not a CMS set of the CSP type which is the topic of the thread you link to.

fractal wrote:It is currently at 145 hrs of processing and time to complete has fluctuated between 10 and 12 hrs for the past two days. Report deadline is 5 hrs from now. You might want to take a look at that unit since, as I reported earlier, nobody has finished it and 64 bit linux cores that could be used to clean up the backlog of pending are tied up on it.

Hmmm, that means you must have had that WU quite long in your queue without working on it, because the CMC WUs use to have a 14 day deadline. So, if you invested 145 hrs and the final deadline is in 5 hrs - well, you can calculate yourself. I do not know whether it will be completed in time, but I can do a test run on a 955 BE to see what run time is expected. Maybe just keep it running if you do not mind. It might really help us fix a problem.

Michael.

[edit]: Run time estimate on AMD 955 BE (3.0 GHz Quad): 85 hrs. I had this WU before, it took more than 145 hrs then my machine was accidentially detached from BOINC - restart. :worry: I noticed that building this WU takes 2.2 GB of RAM! Could you please have a look whether your machine is swapping RAM to disk due to memory limits? Is it possible that this causes the delay? In that case, we would need to assign higher RAM requirements than we use presently. I checked that the forecast runs of this WU use already 800 MB of memory. So, the original run will most liely require more. The problem s that RAM requirement of the CMC WUs cannot be precomputed. That is a problem and therefore we are in contact with the developers to have a RAM forecast function in a future version.
Fördern, nicht fordern. Kooperieren statt konkurrieren. Konstruieren statt konsumieren.

Image

Image Image Image
User avatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
 
Posts: 18592
Joined: 07.01.2002 01:00
Location: Marpurk

Re: Long running work unit

Unread postby Michael H.W. Weber » 28.02.2010 02:12

FalconFly wrote:I also have two WorkUnits ( example ) that could not finish within the deadline.

Both take approx. 670hrs on a 2.5GHz Phenom II X4 905e, which isn't quite a slow CPU.
x86_64 Linux reports normal CPU usage, Task is running normal.

As the deadline is already blown, should I abort those or should I wait another ~340hrs for them to complete ?
(i.e. will the server still accept them ?)

-- edit --

Restarting them seems to have resetted their figures. I guess I'll keep an eye on how they proceed (?)

That is strange. My 955 BE estimates a run time of 244 hrs for this WU (will most likely be a bit more) with a RAM usage of up to 1.2 GB (might be higher, will increase and even change up and down during computation). Could you please also check for swapping of RAM to HD?

Michael.
Fördern, nicht fordern. Kooperieren statt konkurrieren. Konstruieren statt konsumieren.

Image

Image Image Image
User avatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
 
Posts: 18592
Joined: 07.01.2002 01:00
Location: Marpurk

Re: Long running work unit

Unread postby fractal » 28.02.2010 05:32

I think you got it.

boinc@g31mx-1:~$ free
total used free shared buffers cached
Mem: 4038988 3770556 268432 0 109184 357996
-/+ buffers/cache: 3303376 735612
Swap: 1253028 916248 336780


boinc ran some units from another project and switched to other units of the same project leaving them in memory. This eventually ate all of memory. I have suspended all other projects until this unit completes.

It is weird. I have told boinc not to switch between applications, but it just won't listen. It just loads them all up in memory and alternates between them. Silly program.
fractal
Idle-Sammler
Idle-Sammler
 
Posts: 6
Joined: 27.02.2010 17:48

Re: Long running work unit

Unread postby FalconFly » 28.02.2010 09:30

With me that wasn't the case I think (at least not yesterday when I checked on both machines).

cmcalibrate RAM uses only ~700MB, currently running along WCG (only 30MB per Task) and leaving some 2.8GB RAM free and 2GB Swap unused.

Even if cmcalibrate went up to 3.5GB, there still would be no impact on the systems (4GB RAM).

Since the next attempt is way out of the deadline (failed checkpoint set them back from 340h runtime to like 12min runtime), I think I'll just abort them like all the other Users had to.
Scientific Network : 44800 MHz - 77824 MB - 1970 GB
User avatar
FalconFly
Mikrocruncher
Mikrocruncher
 
Posts: 25
Joined: 28.07.2009 18:49
Location: 5335N 00745E

Re: Long running work unit

Unread postby Michael H.W. Weber » 28.02.2010 12:48

No, please keep it running. If the machine is not swapping and no zombie task is detectable, I do not see a reason why the WU should be dead. It should count down, soon. I just had that sitation on my box.

Michael.
Fördern, nicht fordern. Kooperieren statt konkurrieren. Konstruieren statt konsumieren.

Image

Image Image Image
User avatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
 
Posts: 18592
Joined: 07.01.2002 01:00
Location: Marpurk

Re: Long running work unit

Unread postby FalconFly » 28.02.2010 18:44

Okidok, I'll let them run.
It'll be at least 4 weeks before they can finish, however - 19h done, ~650h to go :roll:

(I wonder what kind of calibration is done that needs such enormous runtimes?)
Scientific Network : 44800 MHz - 77824 MB - 1970 GB
User avatar
FalconFly
Mikrocruncher
Mikrocruncher
 
Posts: 25
Joined: 28.07.2009 18:49
Location: 5335N 00745E

Re: Long running work unit

Unread postby Al Dente » 11.03.2010 19:41

And another very long unit (513689, cms_6S6[e]_Monodelphis-domestica-(gray-short-tailed-opossum)_CM000370.lin.EMBL_f_1268060823_33_0), running on a Q6600/2.4GHz, but only 2GB RAM.

It's currently at 63¾ hours @ 9.1%, so ~637 hours/26½ days to go at the current rate; the progress bar is clicking up 0.001% per tick. The deadline is 18/3 (just under 7 days); it's showing 125 hours to go, so BOINC hasn't put it on high priority yet.

Abort (I hate wasting crunching time) or persevere?
Al Dente
Fingerzähler
Fingerzähler
 
Posts: 2
Joined: 11.03.2010 19:27

Re: Long running work unit

Unread postby yoyo » 11.03.2010 20:12

Al Dente wrote:And another very long unit (513689, cms_6S6[e]_Monodelphis-domestica-(gray-short-tailed-opossum)_CM000370.lin.EMBL_f_1268060823_33_0), running on a Q6600/2.4GHz, but only 2GB RAM.

It's currently at 63¾ hours @ 9.1%, so ~637 hours/26½ days to go at the current rate; the progress bar is clicking up 0.001% per tick. The deadline is 18/3 (just under 7 days); it's showing 125 hours to go, so BOINC hasn't put it on high priority yet.

Abort (I hate wasting crunching time) or persevere?

How much RAM does the wu consumes?
yoyo
HILF mit im Rechenkraft-WiKi, dies gibts zu tun.
Wiki - FAQ - Verein - Chat

Image Image
User avatar
yoyo
Vereinsvorstand
Vereinsvorstand
 
Posts: 7206
Joined: 17.12.2002 14:09
Location: Berlin

Next

Return to RNA World Discussions (english)

Who is online

Users browsing this forum: No registered users and 1 guest