Long running work unit

Everything about the project RNA World
Nachricht
Autor
Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Long running work unit

#841 Ungelesener Beitrag von Jacob Klein » 15.06.2017 15:04

robertmiles hat geschrieben:
robertmiles hat geschrieben:I have a task with an estimated remaining time over 90 days, months beyond the deadline.

http://www.rnaworld.de/rnaworld/result. ... d=14953356

[snip]
Progress rate suggests that it will run for about 150 days total.

Size of the *.vdi file may be a problem for users running on an SSD drive.

Having only one *.vdi file (which appears to be the checkpoint file) could make the task fail if some interruption occurs while the next checkpoint is being written. Most BOINC projects keep two checkpoint files, so that their applications can still resume if writing the latest checkpoint file was interrupted and therefore incomplete.

Installing the latest Windows 10 update (Creator's Edition) interferes with reading the *.vdi file, since Windows 10 then claims that no application is listed as able to open *.vdi files. Workaround - tell VirtualBox to ignore this problem when resuming from a checkpoint.

Progress rate, and estimation of completion, are completely useless for these tasks. The only way to even remotely estimate how long they'll take to complete, is to compare against a wingman's completion time and account for difference is processor speeds/capabilities... and even then, you don't know if they were hyperthreading or not. Rough estimates of task completion times, based on my EXPERIENCE, are: 150-300 days on a fast processor, 250-550 days on a slow processor.

A checkpoint consists of both a .vdi file (which may be tiny) and a .sav file (which is huge). I also find it a bit strange that BOINC's implementation of the VBoxWrapper, only keeps 1 checkpoint for these tasks... but I also understand it. If each checkpoint consumes 700 MB of disk space, then (to some degree) it makes sense to keep 1 checkpoint and not 2.

Windows 10 Creator's Update doesn't interfere with anything, from my testing... except that it requires certain minimum versions of VirtualBox, that's all. In fact, why are you even trying to use VirtualBox Manager while a BOINC task is trying to complete? THAT can easily disturb the BOINC task (the VM can get locked, and VBoxWrapper might freak out). I'd use extreme caution before using VirtualBox Manager, for anything, while BOINC is processing VM tasks.

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 22419
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Long running work unit

#842 Ungelesener Beitrag von Michael H.W. Weber » 16.06.2017 10:50

Actually, the only failure in RNA World known to me currently relate to a system crash / premature system shutdown during the (over-) writing of the checkpoint snapshot. So, keeping the previous snapshot would probably indeed be helpful...

Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

robertmiles
XBOX360-Installer
XBOX360-Installer
Beiträge: 86
Registriert: 23.02.2010 18:43
Wohnort: northern Alabama, US

Re: Long running work unit

#843 Ungelesener Beitrag von robertmiles » 17.06.2017 01:58

Jacob Klein hat geschrieben:
robertmiles hat geschrieben:
robertmiles hat geschrieben:I have a task with an estimated remaining time over 90 days, months beyond the deadline.

http://www.rnaworld.de/rnaworld/result. ... d=14953356

[snip]
Size of the *.vdi file may be a problem for users running on an SSD drive.

Having only one *.vdi file (which appears to be the checkpoint file) could make the task fail if some interruption occurs while the next checkpoint is being written. Most BOINC projects keep two checkpoint files, so that their applications can still resume if writing the latest checkpoint file was interrupted and therefore incomplete.
I've now noticed that there are two *.vdi files within the slot. The one I noticed previously is about 850 MB. The one with the *.sav file is much smaller.

Installing the latest Windows 10 update (Creator's Edition) interferes with reading the *.vdi file, since Windows 10 then claims that no application is listed as able to open *.vdi files. Workaround - tell VirtualBox to ignore this problem when resuming from a checkpoint.
[snip]

A checkpoint consists of both a .vdi file (which may be tiny) and a .sav file (which is huge). I also find it a bit strange that BOINC's implementation of the VBoxWrapper, only keeps 1 checkpoint for these tasks... but I also understand it. If each checkpoint consumes 700 MB of disk space, then (to some degree) it makes sense to keep 1 checkpoint and not 2.

Windows 10 Creator's Update doesn't interfere with anything, from my testing... except that it requires certain minimum versions of VirtualBox, that's all. In fact, why are you even trying to use VirtualBox Manager while a BOINC task is trying to complete? THAT can easily disturb the BOINC task (the VM can get locked, and VBoxWrapper might freak out). I'd use extreme caution before using VirtualBox Manager, for anything, while BOINC is processing VM tasks.
I had to use VirtualBox Manager for this - the task wouldn't resume otherwise.

No wingman had completed this workunit successfully, so I couldn't use a wingman's completion time for anything.

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Long running work unit

#844 Ungelesener Beitrag von Jacob Klein » 17.06.2017 02:18

Well, be sure to IGNORE all of the following:
- the progress percentage
- the estimated time remaining
- the work unit's estimated runtime on reference system

All those above are, from my experience, essentially worthless.

If the work unit is using CPU, and progress.txt is either increasing or is at 0.98765 but still being touched every 30 seconds.... then the task is likely okay.
In regards to completion estimation, only the wingman completion time matters, and still involves lots of interpolation, just to arrive at a super rough guess.

There truly are, only monsters left, here. I have 18 of 151.

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 22419
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Long running work unit

#845 Ungelesener Beitrag von Michael H.W. Weber » 17.06.2017 12:44

Jacob Klein hat geschrieben:There truly are, only monsters left, here. I have 18 of 151.
...well, I just don't get any of these, although my machines repeatedly ask for "stress". :D

Regarding the total run time estimates:
It is NOT the case that these are numbers totally out of the blue. If you run the tasks manually on the command line, the estimate is quite OK. The problem is, however, that the estimates which the participants get, are those generated on the RNA World server. These are then adapted to the specs of the true client machine and this conversion apparently isn't that good yet.
Another thing is that although the estimates are OK for default tasks, they might be worse for parameterized tasks - which are all those extremely long running ones which we pack into Virtualbox. These I cannot check by hand using the commandline due to their extreme duration.
From my own experience it is a good estimate to say that for AMD machines (Ryzen CPUs so far not included) the true run time appears to be double of what is initially assigned by the server (and shown in the client).

The progress bar indeed has little reliability.

Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Long running work unit

#846 Ungelesener Beitrag von Jacob Klein » 17.06.2017 13:29

Michael H.W. Weber hat geschrieben:
Jacob Klein hat geschrieben:There truly are, only monsters left, here. I have 18 of 151.
...well, I just don't get any of these, although my machines repeatedly ask for "stress". :D
It is likely that I have configured my machines to ask for work 20+ times more frequently as yours. If you want details, PM me.
Michael H.W. Weber hat geschrieben:Regarding the total run time estimates:
It is NOT the case that these are numbers totally out of the blue. If you run the tasks manually on the command line, the estimate is quite OK. The problem is, however, that the estimates which the participants get, are those generated on the RNA World server. These are then adapted to the specs of the true client machine and this conversion apparently isn't that good yet.
Another thing is that although the estimates are OK for default tasks, they might be worse for parameterized tasks - which are all those extremely long running ones which we pack into Virtualbox. These I cannot check by hand using the commandline due to their extreme duration.
From my own experience it is a good estimate to say that for AMD machines (Ryzen CPUs so far not included) the true run time appears to be double of what is initially assigned by the server (and shown in the client).

The progress bar indeed has little reliability.

Michael.
I never said they were pulled out of the blue. But I can shed a little more light on this, I think.

- Infernal (the program used to do the work) does support giving an "forecast" (estimate) of how long the current work item's workload will take, on the PC, running at its current load. This may be reliable at small sized forecasts, but is definitely quite unreliable at large sized forecasts.
- The first thing that the VM tasks do, is run the forecast operations, before doing the computation operations. You can see this in Christian's "boinc_app" script file (copy it from a task's shared folder, to some other folder, then look at it). It runs the forecast first -- See where it populates cpu_forecast.
- Christian also pads the forecast a bit, due to its unreliability. There's a section in the script that does: #if forecast is over 250h add another 25% to be sure not to underestimate....... But clearly it still underestimates by a TON, since we're stuck at 98.765% usually about 40-75% of the task's total duration, for biggest tasks, according to my logs.
- When you click "Show VM Console", and you see "forecast: x sec", I believe x is the "after-padded" forecast value.
- If you click on a work unit on the web and see a "estimated runtime on reference system", I believe that is the "after-padded" forecast value, as ran/forecasted on the "reference system", which has a hyperlink you can click, to see is a "AMD Athlon(tm) II X2 250 Processor"

Long story short:
Don't rely on any of this forecasting, nor progress percentage, nor estimated time remaining. You can ONLY (sort of, barely) rely on wingman completion times, with a bit of interpolation (differing CPU abilities) and guesswork (were systems hyperthreaded?). Fun.


Regards,
Jacob

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 22419
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Long running work unit

#847 Ungelesener Beitrag von Michael H.W. Weber » 17.06.2017 14:18

Yes, the longer the runs, the more unreliable the forecast is. That is actually in line with the practical observation that especially the VM tasks are more wildly underestimated as I stated above.

I was actually thinking whether it could be possible to systematically extract data from WUprop in order to develop a better, empirical estimate function. What do you think about this?

Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Long running work unit

#848 Ungelesener Beitrag von Jacob Klein » 17.06.2017 14:24

I don't think that's a great idea, as any "general estimate" that you derive from WuProp, won't be based on the actual work unit that you're working on. No offense.

I continue to think that the best approach towards estimation, is to record the wingman completion time and cpu type, load the Excel "estimator" spreadsheet that I created located here:
viewtopic.php?f=75&t=16160#p164693
... and plug in correct values for wingman time, wingman cpu, your cpu.

Even then, it's a rough estimate (that may not be applicable if wingman has different cpu manufacturer)... but at least it's based on the actual work unit you have that a wingman completed, which is the best we can do.

Also, I think Christian should have edited the script, to not lock it at 98.765, and instead allow it to progress. Even progressing at a rate of 0.001 per hour after 90%, would help the user understand that the task is still running and recording progress, and allow for noticeable progression 416 days out. My longest runner was longer, though, at 555.9 days CPU time. So, maybe use 0.001 per hour after 80%, allowing for 830 days out. :) But alas, he hasn't built that yet. :(

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Long running work unit

#849 Ungelesener Beitrag von Jacob Klein » 06.07.2017 22:59

Racer-X just completed another long-runner, his longest successful task so far!
318.8d CPU Time
:smoking:

http://www.rnaworld.de/rnaworld/result. ... d=14952076

Benutzeravatar
gemini8
Vereinsvorstand
Vereinsvorstand
Beiträge: 5898
Registriert: 31.05.2011 10:30
Wohnort: Hannover

Re: Long running work unit

#850 Ungelesener Beitrag von gemini8 » 07.07.2017 05:47

Omedeto!
Gruß, Jens
- - - - - -
Lowend-User und Teilzeit-Cruncher

Bild Bild Bild
Bild

robertmiles
XBOX360-Installer
XBOX360-Installer
Beiträge: 86
Registriert: 23.02.2010 18:43
Wohnort: northern Alabama, US

Re: Long running work unit

#851 Ungelesener Beitrag von robertmiles » 16.07.2017 13:53

robertmiles hat geschrieben:
Jacob Klein hat geschrieben:
[snip]

Installing the latest Windows 10 update (Creator's Edition) interferes with reading the *.vdi file, since Windows 10 then claims that no application is listed as able to open *.vdi files. Workaround - tell VirtualBox to ignore this problem when resuming from a checkpoint.
[snip]

A checkpoint consists of both a .vdi file (which may be tiny) and a .sav file (which is huge). I also find it a bit strange that BOINC's implementation of the VBoxWrapper, only keeps 1 checkpoint for these tasks... but I also understand it. If each checkpoint consumes 700 MB of disk space, then (to some degree) it makes sense to keep 1 checkpoint and not 2.

[snip]
I had to use VirtualBox Manager for this - the task wouldn't resume otherwise.

No wingman had completed this workunit successfully, so I couldn't use a wingman's completion time for anything.
Now up to 27.730% progress, 32d 08:42:48 elapsed, 65d 12:05:23 estimated remaining.

It still looks like these numbers are based on an assumption that the total runtime will be about 120 days - but nothing definite on the accuracy of the numbers the application is providing to these calculations.

Note that this application does not run at full speed on a computer that also runs the multi-thread application from Cosmology@Home, since that application grabs all the CPU cores that are not already reserved for something other than BOINC CPU applications.

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 22419
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Long running work unit

#852 Ungelesener Beitrag von Michael H.W. Weber » 16.07.2017 16:21

Those you should better not combine...

Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

Antworten

Zurück zu „RNA World Discussions (english)“