Long running work unit

Everything about the project RNA World
Nachricht
Autor
Bold_Seeker
Idle-Sammler
Idle-Sammler
Beiträge: 3
Registriert: 20.09.2011 17:56

Re: Long running work unit

#289 Ungelesener Beitrag von Bold_Seeker » 20.09.2011 18:00

Hello,

Can you please extend the following two wu's?

task id
13108396
13082277

These giants need another 300 hrs of crunching time before
BOINC "thinks" they are ready.

Thanks in advance!

Benutzeravatar
yoyo
Vereinsvorstand
Vereinsvorstand
Beiträge: 8047
Registriert: 17.12.2002 14:09
Wohnort: Berlin
Kontaktdaten:

Re: Long running work unit

#290 Ungelesener Beitrag von yoyo » 20.09.2011 22:24

done
HILF mit im Rechenkraft-WiKi, dies gibts zu tun.
Wiki - FAQ - Verein - Chat

Bild Bild

Benutzeravatar
Beyond
Prozessor-Polier
Prozessor-Polier
Beiträge: 111
Registriert: 02.02.2008 01:48
Wohnort: Rum River watershed, MN, USA

Re: Long running work units

#291 Ungelesener Beitrag von Beyond » 22.09.2011 17:56

Beyond hat geschrieben:Also as requested I've left this task running (on a 5th machine BTW):
Beyond hat geschrieben:In addition I have one WU that has been sitting at 100% completion for the last few days. It's still using a full core of CPU and is at 179 hours. What to do with it?:

cms_GA[e20-30MB_Lin64f]_1_Oryza-sativa-Japonica-Group_AP008216.lin.EMBL_RF00028_Intron_gpI_1307967123_54908_18
It's now at 323 hours and still showing 100%. Looking at the WU link shows it has been handed out 21 times with no successful completions. Keep running it or abort? If keep running, for how long?
Thanks/Beyond
The machine restarted due to a power glitch at a little over 350 hours runtime and then ran another 27+ hours before I aborted it. It's been handed out 22 times and has no completions. Now it says "too many errors" but has just been sent to a new victim:

http://www.rnaworld.de/rnaworld/workuni ... id=5286987

That brings the wasted time to ~1200 hours for the 5 WUs on 5 different machines that I've been tracking. Four failures for unknown reasons and one due to a power outage. That's just for my few boxes in the last week. How much CPU time is being wasted projectwide?

Regards/Beyond

Ananas
WU-Schieber
WU-Schieber
Beiträge: 1184
Registriert: 27.04.2008 18:37
Wohnort: Nordlichter Köln

Re: Long running work unit

#292 Ungelesener Beitrag von Ananas » 23.09.2011 00:31

The "rice" WUs are hard, this is the "worst" one I have had so far :

http://www.rnaworld.de/rnaworld/workuni ... id=5286011

None of the hosts that tried your result have even been close to that calculation time so far, there are those timeouts though, which probably still are working on that monster.

edit : This one is an ancient BOINC server side bug, not this project's fault. The Berkeley guys refuse to fix it :-(
Beyond hat geschrieben:... Now it says "too many errors" but has just been sent to a new victim ...
vi BOINC/checkin_notes
:1,$s/bug/feature/g
:wq!

Erzaehlen sich Biologen eigentlich Klein-RNA-Witze?

phoenicis
Fingerzähler
Fingerzähler
Beiträge: 2
Registriert: 31.08.2011 20:29

Re: Long running work unit

#293 Ungelesener Beitrag von phoenicis » 25.09.2011 10:17

Looks like I need another extension on Task ID 12905790. It's still crunching away after almost 39 days.

Benutzeravatar
yoyo
Vereinsvorstand
Vereinsvorstand
Beiträge: 8047
Registriert: 17.12.2002 14:09
Wohnort: Berlin
Kontaktdaten:

Re: Long running work unit

#294 Ungelesener Beitrag von yoyo » 25.09.2011 10:25

phoenicis hat geschrieben:Looks like I need another extension on Task ID 12905790. It's still crunching away after almost 39 days.
30 days added.
yoyo
HILF mit im Rechenkraft-WiKi, dies gibts zu tun.
Wiki - FAQ - Verein - Chat

Bild Bild

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 22431
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Long running work units

#295 Ungelesener Beitrag von Michael H.W. Weber » 26.09.2011 20:58

Beyond hat geschrieben:... All 4 of the above WUs have since restarted with the same message in stderr.txt: "No heartbeat from core client for 30 sec - exiting". That's well over 800 hours of wasted CPU time. These WUs are on 4 different machines. All are Win7-64, BOINC 6.12.34, 3 are X6, 1 is an X4. All 4 machines show over a month uptime since last reboot. BOINC task duration in Process Lasso shows that it has not restarted. System times are set to NIST automatically via Win7 but never vary more than a second between updates. No problems whatsoever with any other projects. These huge tasks with no checkpointing are unacceptable IMO.
A few notes on that: first, this heartbeat problem is known for long and is an intrinsic property of BOINC which, unfortunately, we cannot fix. There have been enough complaints about this issue and the BOINC developers declared it as an important feature to be kept in the code, if I remember it right. Ananas here on this forum will know more details.
So, with this one I am really sorry that we can't be more helpful but I can assure you that you are wrong if you assume this is an issue that occurrs exclusively with the RNA World project.

Regarding the checkpointing, I must confess I am a bit tired of repeating over and over again what has long been communicated and is also part of the FAQ (although here we could surely incorporate some more of the details), but let me say it once again: RNA World does support checkpointing for Linux machines running as a 32 bit system and for which memory randomization has been disabled. For all other types of machines, there is no checkpointing available and there will not be a checkpointing any time sooner than checkpointing has become a STANDARD FEATURE OF BOINC which I actually do consider a minimum requirement for such an infrastructure; however, at present we have no choice but to use what is available.
To alleviate that lack of checkpointing in light of the fact that we are not running uppercase here but complex computations which have really, really serious hardware and runtime requirements (probably RNA World is the "hardest" DC project out on the planet), we have at least managed to take the time to established a dual application approach in which the very demanding tasks are sent out exclusively to systems which opted in for using XXL CMSEARCH while all other devices receive the much less demanding S application.
So, in conclusion, all you need to do is to allow only S tasks if you are not willing to have the demanding tasks sent to your machine.
These are the facts and measures to take and you can believe me when I am telling you that I am happy for every single one who allows the hot stuff to run on his/her machine. We realize that it is disappointing when tasks fail to complete and we take that risk ourselves.
Beyond hat geschrieben:...Oryza-sativa-Japonica-Group_AP008216.lin.EMBL_RF00028_Intron_gpI... is now at 323 hours and still showing 100%.
This is normal behaviour and nothing to worry about. What you should worry about is a situation in which the task manager shows no more CPU activity for a task. The task will probably run over 1000 hrs but I can't tell you exactly because if I could, it would show up in your progress bar correctly. Here again: it is impossible to calculate an exact runtime estimate due to intrinsic properties of the CMSEARCH code which includes stochastic elements.

Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

Benutzeravatar
Beyond
Prozessor-Polier
Prozessor-Polier
Beiträge: 111
Registriert: 02.02.2008 01:48
Wohnort: Rum River watershed, MN, USA

Re: Long running work units

#296 Ungelesener Beitrag von Beyond » 29.09.2011 15:55

Michael H.W. Weber hat geschrieben:we have at least managed to take the time to established a dual application approach in which the very demanding tasks are sent out exclusively to systems which opted in for using XXL CMSEARCH while all other devices receive the much less demanding S application.
There was a considerable period when only XXL WUs were available, no S WUs were being sent out. Since this is a Vault project we pretty much have to keep running it. Suffice to say that many team members became very irritated and quit running RNA altogether. Realize that for every user running your project and commenting when they have a problem there are 10 who simply quit. If I were you I'd be happy when someone asks for fixes even if they've been requested many times before. It tells you what's important to those donating to your project. It's costing us a lot of money in equipment and power bills to run the project and being polite and accommodating to people donating expensive resources will more likely motivate them to stay.

I'll relate a story about a different and perhaps useful way. Some of us were being sponsored by a local equipment manufacturer. After about a year everyone else was dropped from the program even though some of them were considerably faster. The company CEO later told me that it was because I was the only one telling them what was wrong with their product. That's what they wanted to know: what they needed to improve. Probably the attitude that made them successful.

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 22431
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Long running work unit

#297 Ungelesener Beitrag von Michael H.W. Weber » 29.09.2011 20:44

I really appreciate your and other's criticizm at any time and I hope, as you can also see above, that I am doing my best to respond to that and answer your questions as I am (and others from our team are also) simultaneously doing in several external team forums. It appears to me as if you perceived my wording above some sort of impolite (I was actually rather a little "uneasy" with the way BOINC is managed with respect to the "heartbeat issue" (it really annoys us over and over again!) than with your postings)? If that is the case, I can assure you that this was definetely not my intention and is probably just a result of someone writing something in a foreign language (the written word always differs from the spoken one). :good: Still, I hope that I have clarified the issues exhaustively - at least in a technical way.

Michael.

P.S.: Please note that I myself have contributed to a broad set of DC projects for about 10 years. I therefore believe that I know the perspective of the donator fairly well. Or in other words: you are certainly not dealing with someone who just wants his stuff being done without caring about anythig else.
To me, however, the question sometimes arises whether the donators can also imagine the pespective of a project developer? For example the simple fact that funding and manpower are limiting? The fact that WUs are not handed out for the sake of having WUs available for people to pile up virtual credits but rather to answer a (scientific) question? The fact that DC project developers are not BOINC developers but just use this as a general infrastructure to achieve their goal? The fact that using a DC approach does not necessarily mean that the project lead is an IT specialist? Just to mention only a few aspects...
You can't imagine how many nice things would be possible with this project if there was only enough time (I have a job unrelated to this project and would wish to make this project part of my daily work and at least the former is true for all my colleagues here aboard) and manpower (unfortunately requires funding) to implement all the nice ideas.
I wished, I could press a button and checkpointing would suddenly be there and RAM demands would be reduced 10-fold. But that is not reality at present. :roll:
At least, not yet. :P
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

ftpd

Re: Long running work unit

#298 Ungelesener Beitrag von ftpd » 30.09.2011 21:00

ftpd hat geschrieben:Hej YoYo,

I have one itron-wu task-id = 13121683 host-id = 9125.
Deadline is 24/09 and is still running for more then 320 hrs.

Extend please deadline.

Thx,

Ton

PS Good to have small-wu's again. Thx!
Hej YoYo,

Job finished in 585 hrs. I have a new one saying 2974 hrs to go. Correct? Same machine! Finish date 18/10/2011

Thx - Ton

Benutzeravatar
Beyond
Prozessor-Polier
Prozessor-Polier
Beiträge: 111
Registriert: 02.02.2008 01:48
Wohnort: Rum River watershed, MN, USA

Re: Long running work units

#299 Ungelesener Beitrag von Beyond » 05.10.2011 15:43

Michael H.W. Weber hat geschrieben:To alleviate that lack of checkpointing in light of the fact that we are not running uppercase here but complex computations which have really, really serious hardware and runtime requirements (probably RNA World is the "hardest" DC project out on the planet), we have at least managed to take the time to established a dual application approach in which the very demanding tasks are sent out exclusively to systems which opted in for using XXL CMSEARCH while all other devices receive the much less demanding S application.
Again, no cmsearch S WUs for a while. How about releasing some of the shorter cmsearch XXL WUs as cmsearch M? That would allow allow those of us that are running into the BOINC heartbeat "bug" and unreliable power companies to at least help out the project to a greater extent than we are currently able to.

Edit: How about several different WU levels: S, M, L, XL & XXL? I'm sure the project would see a big increase in WUs completed.

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 22431
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Long running work unit

#300 Ungelesener Beitrag von Michael H.W. Weber » 07.10.2011 18:54

Well, there should be some S WUs again and more should pour in, soon. :D
Instead of implementing additional M XL, etc. applications (which then might also be a bit confusing for the users to overlook), it might be an idea to dynamically adjust what is fed into S or XXL applications under special circumstances such that e.g. in case of availability of very few short WUs, the shortest ones of the XXL application would then be re-directed into S.

Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

Antworten

Zurück zu „RNA World Discussions (english)“