Waste of resources.

Everything about the project RNA World
Nachricht
Autor
adrianxw

Waste of resources.

#1 Ungelesener Beitrag von adrianxw » 31.12.2012 14:13

This wu caused me to raise my eyebrow. Ran for a good amount of time, (42,637 seconds - nearly 12 hours in my case), then junked out. Normally, I'd shrug my shoulders and say fair enough, dodgy wu - but then I looked to see if it had been sent to someone who had finished it. I was really surprised by what I saw. That wu has wasted a HUGE number of people a HUGE number of crunching hours that could have been doing useful work.

This MUST be addressed or people will dump the project.

Benutzeravatar
yoyo
Vereinsvorstand
Vereinsvorstand
Beiträge: 7805
Registriert: 17.12.2002 14:09
Wohnort: Berlin
Kontaktdaten:

Re: Waste of resources.

#2 Ungelesener Beitrag von yoyo » 31.12.2012 15:56

I will run this wu in a separate box.
yoyo
HILF mit im Rechenkraft-WiKi, dies gibts zu tun.
Wiki - FAQ - Verein - Chat

Bild Bild

Benutzeravatar
yoyo
Vereinsvorstand
Vereinsvorstand
Beiträge: 7805
Registriert: 17.12.2002 14:09
Wohnort: Berlin
Kontaktdaten:

Re: Waste of resources.

#3 Ungelesener Beitrag von yoyo » 01.01.2013 15:49

The WU is still running on my system, now it cunsumed nearly 24h cpu time. But as you can see it consumes 1.9 GB RAM.
So it runs now longer as on all systems were it was aborted. I think it aborted because the systems had to less memory. All have less than 4 GB RAM.

My running cmsearch:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24463 boincadm 20 0 1915m 1.9g 916 R 100 15.9 1437:37 cmsearch

yoyo
HILF mit im Rechenkraft-WiKi, dies gibts zu tun.
Wiki - FAQ - Verein - Chat

Bild Bild

Ananas
WU-Schieber
WU-Schieber
Beiträge: 1184
Registriert: 27.04.2008 18:37
Wohnort: Nordlichter Köln

Re: Waste of resources.

#4 Ungelesener Beitrag von Ananas » 01.01.2013 22:33

Why does it have two applications?

16 Dec 2012 and before : cmsearch XXL (long) 1.0.2 v0.31

22 Dec 2012 and after : cmsearch S (short) 1.0.2 v0.31
vi BOINC/checkin_notes
:1,$s/bug/feature/g
:wq!

Erzaehlen sich Biologen eigentlich Klein-RNA-Witze?

Benutzeravatar
yoyo
Vereinsvorstand
Vereinsvorstand
Beiträge: 7805
Registriert: 17.12.2002 14:09
Wohnort: Berlin
Kontaktdaten:

Re: Waste of resources.

#5 Ungelesener Beitrag von yoyo » 02.01.2013 09:56

Because I changed some not finished short XXL to S.
yoyo
HILF mit im Rechenkraft-WiKi, dies gibts zu tun.
Wiki - FAQ - Verein - Chat

Bild Bild

adrianxw

Re: Waste of resources.

#6 Ungelesener Beitrag von adrianxw » 02.01.2013 16:44

This one looks like it is going the same way as well...

Benutzeravatar
Conan
XBOX360-Installer
XBOX360-Installer
Beiträge: 79
Registriert: 11.02.2010 11:09

Re: Waste of resources.

#7 Ungelesener Beitrag von Conan » 19.02.2013 23:04

This Work Unit has been running fine until I got up this morning and found that it had restarted for no apparent reason.

I am using 64 bit Linux and the system did not restart, it has 4 GB RAM (no error to say this was the problem).

The WU had 111.51 hours on it at Midnight and made a "Sending Scheduler Request: Requested By Project" at 12:11.
At an estimated time of 12:21 another volunteer returned a result that made quorum.
My computer sent another "Sending Scheduler Request: Requested By Project", at 05:11.
My WU restarted itself at 05:14.
At this point it would of had over 118 Hours clocked against this WU and about 30% completed.

Now looking at the time lines it would appear that on the second contact with the poject it was informed by the server that quorum had been met and the WU was either tried to be aborted by the Project or on server contact a signal was sent that made the WU restart.

Is this a possible scenario?

It seems a big coincidence that after quorum was made and my computer contacted the project server that my WU failed and restarted with no error messages at my end, just contact server then restart.

Losing 118 hours peeved me greatly, and after finding that quorum had been met I have now Aborted that WU as there is no point running it anymore.

The other thing is that when a WU restarts you lose all record of what you have already done.
In this case the WU would report 2.43 hours of run time not the 118 hours I actually did (there are a number of projects that do this as well, not just RNA World, usually BOINC restarting or a computer reboot does this and a lack of checkpointing loses the information).

Why on restart can't the run time or CPU time record be retained even if the WU does start from the begining again? It would also show to the project where WU's are not running smoothly by the restart information kept in the WU when it reports and maybe enable updates that fix problems before volunteers start complaining?

WUProp@Home records information about various applications such as run time, memory usage, application name and Project name.
It has started a Badge system that tracks run time per application, you would need 20 applications clocking up 100 hours run time each to get your first badge (Bronze).
I have found that even though my failed WU on RNA World only reported 2.43 Hours to the project and lost all other time spent on that work unit, WUProp@Home has recorded all 118 Hours (my time jumped from 81 hours to 200 hours of reported work spent on RNA World).

This is why I ask if that information can be retained in the WU to show how much work you have put into projects applications, it would show a more accurate picture of resources invested in a project.

Conan
Bild

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 20463
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Waste of resources.

#8 Ungelesener Beitrag von Michael H.W. Weber » 20.02.2013 09:17

When did you last reset the project? To me it sounds a bit like the well-known "Berkeley-caused heart beat bug" which we have fixed ourselves only recently - at least in the RNA World code. But maybe Yoyo and/or Ananas have another idea?

MIchael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

Benutzeravatar
Conan
XBOX360-Installer
XBOX360-Installer
Beiträge: 79
Registriert: 11.02.2010 11:09

Re: Waste of resources.

#9 Ungelesener Beitrag von Conan » 20.02.2013 12:20

Michael H.W. Weber hat geschrieben:When did you last reset the project? To me it sounds a bit like the well-known "Berkeley-caused heart beat bug" which we have fixed ourselves only recently - at least in the RNA World code. But maybe Yoyo and/or Ananas have another idea?

MIchael.
G'Day Michael,

I have not reset this project on any computer that I own. If this is the suggested fix I can reset no problem as I don't have any work units from RNA World at the moment (on that computer at least), so I will give it a go.

Conan
Bild

Benutzeravatar
Conan
XBOX360-Installer
XBOX360-Installer
Beiträge: 79
Registriert: 11.02.2010 11:09

Re: Waste of resources.

#10 Ungelesener Beitrag von Conan » 21.02.2013 05:00

I have had This WU run up to about 68 hours then restart, run back up to about 68 hours (about 2% further in the processing, around 20% the first time then around 22% the second time), and start again.

Each time it has been due to a computer reboot.
My other computers have not rebooted so it has not been a Power problem. My other Windows machine also has not rebooted so it with this machine.
From what I can find each time BOINC may have caused the computer to reboot due to a MD5 Checksum issue.
It has to do with fmah (fightmalaria@home), where the MD5 checksum is run against every file in fmah directory (of which there are dozens), and apparently it fails.
This seems to cause BOINC to give up and somehow reboot the computer.
Why it is doing this I don't know as I am not even running fightmalaria.

I am not going to run this WU a third time to have it fail again so have aborted it.
The system only says I have 1,100 seconds of work on this WU not the 136 Hours I have invested.
I have also detached from Fightmalaria to see if it stops this issue.

My other Windows machine is also Win XP 32 bit and also attached to fightmalaria but is not having the issue and has progressed to over 80 hours on its current work unit.
So I don't know why this machine wont play the game, possibly a corrupt file that only pops up on long running work units that don't checkpoint during a BOINC security check?
My longest work unit prior to this was 25 hours on an ECM YOYO work unit.

Anyway I don't know if anyone else has struck this problem.
I have removed fightmalaria for now and may have to update BOINC Client if this issue arises again.

I will try another WU and see what happens.

Conan
Bild

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 20463
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Waste of resources.

#11 Ungelesener Beitrag von Michael H.W. Weber » 22.02.2013 12:38

Ok, but please first reset the RNA World project once to ensure you are really getting the latest code in which the heart beat bug has been fixed. :wink:

Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

barblovesroses
Idle-Sammler
Idle-Sammler
Beiträge: 3
Registriert: 22.04.2013 02:40

Re: Waste of resources.

#12 Ungelesener Beitrag von barblovesroses » 22.04.2013 04:08

I seem to have the same type of problem as what Conan has been discussing here. Tonight I aborted a job that had no chance of finishing before the deadline. It had restarted at least twice. It was approximately 120 hours into the job but had at least 2000 hours to go still and since reading in one of the threads on here that if the deadline came and went there would be no credit anyway...well, it just wasn't worth considering.

I am not sure that I have ever been able to complete any of the longer jobs - anything running over 500 hours in length because of the fact that you do not have any checkpointing and the jobs restart over from 0 if you have to reboot your computer for any reason. Its highly frustrating to me that there isn't some way that these jobs get saved in our systems or have periodic checkpoints.

For the time being I have set my computer not to take on any further jobs even though the research that you are doing is vital. If I can't finish the work, why should I spend hundreds of hours crunching? Its a waste of my computer's time and my energy resources for my computer.

If you can change your reporting so that some of these issues get fixed I'm more than willing to do work for this project again, and I'm sure others would love to also as I'm sure you've lost other crunchers from this project for these same reasons.

Antworten

Zurück zu „RNA World Discussions (english)“