Long running work unit

Everything about the project RNA World
Nachricht
Autor
fractal
Idle-Sammler
Idle-Sammler
Beiträge: 6
Registriert: 27.02.2010 17:48

Long running work unit

#1 Ungelesener Beitrag von fractal » 27.02.2010 17:59

http://www.rnaworld.de/rnaworld/workuni ... uid=157852

One of my machines picked up a rather long running work unit. The machine is a 2.4 GHz Q6600 with 4G of RAM running linux64 so should be powerful enough.

It is current at 6 days compute time and went high priority yesterday. The time to complete continues to increase and if it is correct, will no longer complete before the deadline. Two people have already aborted Workunit 157852, two people have had errors and 9 of us are still grinding away at it.

I read the thread that says a 2 GHz athlon will finish all work in a couple of hours. If this is true then shouldn't the work automatically abort after a day and not tie up a core for a week when you guys give us buggy work?

Are all work units that have run more than 2 hours buggy and need to be aborted? It seems wasteful to tie up resources for so long when they would be more productive doing something else.

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 22414
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Long running work unit

#2 Ungelesener Beitrag von Michael H.W. Weber » 27.02.2010 19:41

fractal hat geschrieben:I read the thread that says a 2 GHz athlon will finish all work in a couple of hours. If this is true then shouldn't the work automatically abort after a day and not tie up a core for a week when you guys give us buggy work?
Well, what thread are you actually talking about? :o I ask, because I need to inform you that this information is completely incorrect. There are CMC work units out that require more than a week of full-time computing even on such a powerful machine as you luckily have one. Just as said in our FAQ. :wink: Of course, I cannot guarantee to you that it will finish in time because you did not give any information on the deadline but normally, this WU should be completable by your machine - this at least I can tell you. :D The progress bar is good but not strictly reliable since with the very long WUs indeed we observe a steady increase in run time estimation while the machine continues to compute. Later on, the remaining run time should start decreasing or the WU just suddenly fnishes validly. Please check in your Linux system monitor whether that WU is a zombie task or not. If not, I would keep it running. I hope this helps a bit.

Michael.

P.S.: We do not send out any buggy WU and no WU requires manual abortion. There are also only very few if any reports on errors with this project. Quite frequently, people just have not enough patience to wait for proper completion. In this case, it would help us to improve project overall performance by deleting early if a WU seems too long because by doing this we can send it out again quickly. Anything else will just delay the entire project progress.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

fractal
Idle-Sammler
Idle-Sammler
Beiträge: 6
Registriert: 27.02.2010 17:48

Re: Long running work unit

#3 Ungelesener Beitrag von fractal » 27.02.2010 20:50

I was referring to http://www.rechenkraft.net/phpBB/viewto ... 75&t=10592 . Re-reading it I see you said 47 hrs on a 2ghz AMD.

It is currently at 145 hrs of processing and time to complete has fluctuated between 10 and 12 hrs for the past two days. Report deadline is 5 hrs from now. You might want to take a look at that unit since, as I reported earlier, nobody has finished it and 64 bit linux cores that could be used to clean up the backlog of pending are tied up on it.

Benutzeravatar
FalconFly
Mikrocruncher
Mikrocruncher
Beiträge: 25
Registriert: 28.07.2009 18:49
Wohnort: 5335N 00745E
Kontaktdaten:

Re: Long running work unit

#4 Ungelesener Beitrag von FalconFly » 27.02.2010 23:45

I also have two WorkUnits ( example ) that could not finish within the deadline.

Both take approx. 670hrs on a 2.5GHz Phenom II X4 905e, which isn't quite a slow CPU.
x86_64 Linux reports normal CPU usage, Task is running normal.

As the deadline is already blown, should I abort those or should I wait another ~340hrs for them to complete ?
(i.e. will the server still accept them ?)

-- edit --

Restarting them seems to have resetted their figures. I guess I'll keep an eye on how they proceed (?)
Scientific Network : 44800 MHz - 77824 MB - 1970 GB

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 22414
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Long running work unit

#5 Ungelesener Beitrag von Michael H.W. Weber » 28.02.2010 01:22

fractal hat geschrieben:I was referring to http://www.rechenkraft.net/phpBB/viewto ... 75&t=10592 . Re-reading it I see you said 47 hrs on a 2ghz AMD.
No, you are referring to something unrelated. Here we have a CMC WU and not a CMS set of the CSP type which is the topic of the thread you link to.
fractal hat geschrieben:It is currently at 145 hrs of processing and time to complete has fluctuated between 10 and 12 hrs for the past two days. Report deadline is 5 hrs from now. You might want to take a look at that unit since, as I reported earlier, nobody has finished it and 64 bit linux cores that could be used to clean up the backlog of pending are tied up on it.
Hmmm, that means you must have had that WU quite long in your queue without working on it, because the CMC WUs use to have a 14 day deadline. So, if you invested 145 hrs and the final deadline is in 5 hrs - well, you can calculate yourself. I do not know whether it will be completed in time, but I can do a test run on a 955 BE to see what run time is expected. Maybe just keep it running if you do not mind. It might really help us fix a problem.

Michael.

[edit]: Run time estimate on AMD 955 BE (3.0 GHz Quad): 85 hrs. I had this WU before, it took more than 145 hrs then my machine was accidentially detached from BOINC - restart. :worry: I noticed that building this WU takes 2.2 GB of RAM! Could you please have a look whether your machine is swapping RAM to disk due to memory limits? Is it possible that this causes the delay? In that case, we would need to assign higher RAM requirements than we use presently. I checked that the forecast runs of this WU use already 800 MB of memory. So, the original run will most liely require more. The problem s that RAM requirement of the CMC WUs cannot be precomputed. That is a problem and therefore we are in contact with the developers to have a RAM forecast function in a future version.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 22414
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Long running work unit

#6 Ungelesener Beitrag von Michael H.W. Weber » 28.02.2010 02:12

FalconFly hat geschrieben:I also have two WorkUnits ( example ) that could not finish within the deadline.

Both take approx. 670hrs on a 2.5GHz Phenom II X4 905e, which isn't quite a slow CPU.
x86_64 Linux reports normal CPU usage, Task is running normal.

As the deadline is already blown, should I abort those or should I wait another ~340hrs for them to complete ?
(i.e. will the server still accept them ?)

-- edit --

Restarting them seems to have resetted their figures. I guess I'll keep an eye on how they proceed (?)
That is strange. My 955 BE estimates a run time of 244 hrs for this WU (will most likely be a bit more) with a RAM usage of up to 1.2 GB (might be higher, will increase and even change up and down during computation). Could you please also check for swapping of RAM to HD?

Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

fractal
Idle-Sammler
Idle-Sammler
Beiträge: 6
Registriert: 27.02.2010 17:48

Re: Long running work unit

#7 Ungelesener Beitrag von fractal » 28.02.2010 05:32

I think you got it.
boinc@g31mx-1:~$ free
total used free shared buffers cached
Mem: 4038988 3770556 268432 0 109184 357996
-/+ buffers/cache: 3303376 735612
Swap: 1253028 916248 336780
boinc ran some units from another project and switched to other units of the same project leaving them in memory. This eventually ate all of memory. I have suspended all other projects until this unit completes.

It is weird. I have told boinc not to switch between applications, but it just won't listen. It just loads them all up in memory and alternates between them. Silly program.

Benutzeravatar
FalconFly
Mikrocruncher
Mikrocruncher
Beiträge: 25
Registriert: 28.07.2009 18:49
Wohnort: 5335N 00745E
Kontaktdaten:

Re: Long running work unit

#8 Ungelesener Beitrag von FalconFly » 28.02.2010 09:30

With me that wasn't the case I think (at least not yesterday when I checked on both machines).

cmcalibrate RAM uses only ~700MB, currently running along WCG (only 30MB per Task) and leaving some 2.8GB RAM free and 2GB Swap unused.

Even if cmcalibrate went up to 3.5GB, there still would be no impact on the systems (4GB RAM).

Since the next attempt is way out of the deadline (failed checkpoint set them back from 340h runtime to like 12min runtime), I think I'll just abort them like all the other Users had to.
Scientific Network : 44800 MHz - 77824 MB - 1970 GB

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 22414
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Long running work unit

#9 Ungelesener Beitrag von Michael H.W. Weber » 28.02.2010 12:48

No, please keep it running. If the machine is not swapping and no zombie task is detectable, I do not see a reason why the WU should be dead. It should count down, soon. I just had that sitation on my box.

Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

Benutzeravatar
FalconFly
Mikrocruncher
Mikrocruncher
Beiträge: 25
Registriert: 28.07.2009 18:49
Wohnort: 5335N 00745E
Kontaktdaten:

Re: Long running work unit

#10 Ungelesener Beitrag von FalconFly » 28.02.2010 18:44

Okidok, I'll let them run.
It'll be at least 4 weeks before they can finish, however - 19h done, ~650h to go :roll:

(I wonder what kind of calibration is done that needs such enormous runtimes?)
Scientific Network : 44800 MHz - 77824 MB - 1970 GB

Al Dente
Fingerzähler
Fingerzähler
Beiträge: 2
Registriert: 11.03.2010 19:27

Re: Long running work unit

#11 Ungelesener Beitrag von Al Dente » 11.03.2010 19:41

And another very long unit (513689, cms_6S6[e]_Monodelphis-domestica-(gray-short-tailed-opossum)_CM000370.lin.EMBL_f_1268060823_33_0), running on a Q6600/2.4GHz, but only 2GB RAM.

It's currently at 63¾ hours @ 9.1%, so ~637 hours/26½ days to go at the current rate; the progress bar is clicking up 0.001% per tick. The deadline is 18/3 (just under 7 days); it's showing 125 hours to go, so BOINC hasn't put it on high priority yet.

Abort (I hate wasting crunching time) or persevere?

Benutzeravatar
yoyo
Vereinsvorstand
Vereinsvorstand
Beiträge: 8043
Registriert: 17.12.2002 14:09
Wohnort: Berlin
Kontaktdaten:

Re: Long running work unit

#12 Ungelesener Beitrag von yoyo » 11.03.2010 20:12

Al Dente hat geschrieben:And another very long unit (513689, cms_6S6[e]_Monodelphis-domestica-(gray-short-tailed-opossum)_CM000370.lin.EMBL_f_1268060823_33_0), running on a Q6600/2.4GHz, but only 2GB RAM.

It's currently at 63¾ hours @ 9.1%, so ~637 hours/26½ days to go at the current rate; the progress bar is clicking up 0.001% per tick. The deadline is 18/3 (just under 7 days); it's showing 125 hours to go, so BOINC hasn't put it on high priority yet.

Abort (I hate wasting crunching time) or persevere?
How much RAM does the wu consumes?
yoyo
HILF mit im Rechenkraft-WiKi, dies gibts zu tun.
Wiki - FAQ - Verein - Chat

Bild Bild

Antworten

Zurück zu „RNA World Discussions (english)“