Long running work unit

Everything about the project RNA World
Nachricht
Autor
Peter Hucker
Mikrocruncher
Mikrocruncher
Beiträge: 23
Registriert: 19.08.2017 13:56

Re: Long running work unit

#865 Ungelesener Beitrag von Peter Hucker » 19.08.2017 20:30

Gah! Windows 10 messes up again! I think I'll turn off automatic updates like I have on my main machine.

So have I destroyed the task? Or has what I've done so far been recorded on the server?

There seems to be a big problem with these long tasks, if you look at the history of them, they're constantly being aborted or ending in computation errors. I would hope the computation done so far is of some use?

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 512
Registriert: 26.07.2013 15:41

Re: Long running work unit

#866 Ungelesener Beitrag von Jacob Klein » 19.08.2017 20:54

Peter Hucker hat geschrieben:Gah! Windows 10 messes up again! I think I'll turn off automatic updates like I have on my main machine.

So have I destroyed the task? Or has what I've done so far been recorded on the server?

There seems to be a big problem with these long tasks, if you look at the history of them, they're constantly being aborted or ending in computation errors. I would hope the computation done so far is of some use?
If your task is restarting at 0%, then the work you've done is likely lost. Sorry.

For Windows 10, what I do is I set the Windows Update setting for "Active Hours" to "8am to 2am", and then monitor the news for cumulative updates that get released, and make sure to Exit BOINC, install the updates, then restart.. the day the update is released. A good page to monitor might be:
https://support.microsoft.com/en-us/help/4018124

Another option might be to use the Windows Update setting to "Pause" updates for a week, then check updates manually once a week, while BOINC isn't running.

Keep in mind, however, that the VBoxWrapper log indicates something catastropic happened, like a BSOD, so... not sure you can blame this one on an update. Event Viewer might have more info, like "BugCheck" BSOD details.

From my experience, there is nothing wrong with the RNA World tasks themselves. These tasks are fragile, because BOINC's interaction with VirtualBox is fragile. And one hiccup could take out a year's worth of work. There are a few things the devs could do to fix this (to make VBoxWrapper more robust!), but right now they're understaffed.

Peter Hucker
Mikrocruncher
Mikrocruncher
Beiträge: 23
Registriert: 19.08.2017 13:56

Re: Long running work unit

#867 Ungelesener Beitrag von Peter Hucker » 19.08.2017 21:10

I just disable the windows update service, then no updates ever occur until I check manually.

Well it (and the other machines) updated this morning, so I'm assuming the update caused the mess.

But I don't see why the checkpoints can't be used. The log you looked at of my task showed that it had saved successfully several times previously, so where has that data gone? You can't have a 100 day task and expect it to run for that length of time without problems, that's absurd! No wonder their tasks are never completing.

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 512
Registriert: 26.07.2013 15:41

Re: Long running work unit

#868 Ungelesener Beitrag von Jacob Klein » 19.08.2017 21:26

If you'd like me to connect to your machine using TeamViewer, and take a look (to see if anything is reusable), I'd be willing. If so, send me a private message.

Peter Hucker
Mikrocruncher
Mikrocruncher
Beiträge: 23
Registriert: 19.08.2017 13:56

Re: Long running work unit

#869 Ungelesener Beitrag von Peter Hucker » 19.08.2017 21:28

Can't be bothered sorting the firewall permissions on the router etc. If you want me to email you a file I could do that....

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 512
Registriert: 26.07.2013 15:41

Re: Long running work unit

#870 Ungelesener Beitrag von Jacob Klein » 19.08.2017 21:36

I sent you a PM (private message).

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 512
Registriert: 26.07.2013 15:41

Re: Long running work unit

#871 Ungelesener Beitrag von Jacob Klein » 19.08.2017 22:36

I took a quick look (Thanks Peter)... but it hadn't occurred to me that there'd be nothing to recover, as the task has been aborted. Sorry. Better luck next time.

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 20887
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Long running work unit

#872 Ungelesener Beitrag von Michael H.W. Weber » 19.08.2017 22:48

Peter Hucker hat geschrieben:But I don't see why the checkpoints can't be used.
It is because once a new checkpoint is written, it overwrites the previous one. If your machine crashes during this writing process, then everything is lost. It seems, exactly this has happened in your case. It is quite unprobable but unfortunately may happen...

Regarding the other many "failures": Most of these are tasks aborted by less patient people who are not aware that selecting the VM tasks imposes quite a challenge on their machines. Also, we previously had no checkpoints at all even for these long tasks just because the science application is written for HPC systems. Since this is quite ridiculous, we wrapped the virtual machine approach around that app to enable checkpoints - albeit in a bit uncomfortable way. Technically, however, these long tasks now do very well. But indeed, automatic Windows updates should be disabled...

Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

Peter Hucker
Mikrocruncher
Mikrocruncher
Beiträge: 23
Registriert: 19.08.2017 13:56

Re: Long running work unit

#873 Ungelesener Beitrag von Peter Hucker » 19.08.2017 22:58

Is there no way to make it finish saving the new checkpoint before deleting the previous one?

Also, why are the tasks so huge? Is there no way to split them up and hand parts to many users? That way you could get them completed much faster, and not run so much risk of losing them.

Also, when I said "why can't the checkpoints be used", what I actually meant was doesn't the server have a copy of where I got up to, so the unit can be handed to the next user half done?

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 20887
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Long running work unit

#874 Ungelesener Beitrag von Michael H.W. Weber » 19.08.2017 23:21

Peter Hucker hat geschrieben:Is there no way to make it finish saving the new checkpoint before deleting the previous one?
Well, apparently that is not that trivial...
Peter Hucker hat geschrieben:Also, why are the tasks so huge?
Because these tasks search exhaustively, meaning that the approximations usually resulting in the tremendous acceleration are omitted.
Peter Hucker hat geschrieben:Is there no way to split them up and hand parts to many users?
Yes, with a few clever tricks, there is. But implementing this still requires some efforts (which are on the to-do list).
Peter Hucker hat geschrieben:Also, when I said "why can't the checkpoints be used", what I actually meant was doesn't the server have a copy of where I got up to, so the unit can be handed to the next user half done?
No, the server does in this respect actually not know anything about what the clients are doing. It only receives a trickle-up message how far the individual client is progressing to auto-adjust the tasks deadline on the server-side (run time prediction is actually very difficult for these types of calculations).

Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 512
Registriert: 26.07.2013 15:41

Re: Long running work unit

#875 Ungelesener Beitrag von Jacob Klein » 22.08.2017 20:49


Benutzeravatar
gemini8
Vereinsmitglied
Vereinsmitglied
Beiträge: 3908
Registriert: 31.05.2011 10:30
Wohnort: Hannover

Re: Long running work unit

#876 Ungelesener Beitrag von gemini8 » 22.08.2017 21:06

...and naturally, it is searching for another one? ^^
Gruß, Jens
- - - - - -
Lowend-User und Teilzeitcruncher

Bild Bild Bild
Bild

Antworten

Zurück zu „RNA World Discussions (english)“