Long running work unit
-
- Mikrocruncher
- Beiträge: 30
- Registriert: 19.08.2017 13:56
Re: Long running work unit
Gah! Windows 10 messes up again! I think I'll turn off automatic updates like I have on my main machine.
So have I destroyed the task? Or has what I've done so far been recorded on the server?
There seems to be a big problem with these long tasks, if you look at the history of them, they're constantly being aborted or ending in computation errors. I would hope the computation done so far is of some use?
So have I destroyed the task? Or has what I've done so far been recorded on the server?
There seems to be a big problem with these long tasks, if you look at the history of them, they're constantly being aborted or ending in computation errors. I would hope the computation done so far is of some use?
-
- Brain-Bug
- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: Long running work unit
If your task is restarting at 0%, then the work you've done is likely lost. Sorry.Peter Hucker hat geschrieben:Gah! Windows 10 messes up again! I think I'll turn off automatic updates like I have on my main machine.
So have I destroyed the task? Or has what I've done so far been recorded on the server?
There seems to be a big problem with these long tasks, if you look at the history of them, they're constantly being aborted or ending in computation errors. I would hope the computation done so far is of some use?
For Windows 10, what I do is I set the Windows Update setting for "Active Hours" to "8am to 2am", and then monitor the news for cumulative updates that get released, and make sure to Exit BOINC, install the updates, then restart.. the day the update is released. A good page to monitor might be:
https://support.microsoft.com/en-us/help/4018124
Another option might be to use the Windows Update setting to "Pause" updates for a week, then check updates manually once a week, while BOINC isn't running.
Keep in mind, however, that the VBoxWrapper log indicates something catastropic happened, like a BSOD, so... not sure you can blame this one on an update. Event Viewer might have more info, like "BugCheck" BSOD details.
From my experience, there is nothing wrong with the RNA World tasks themselves. These tasks are fragile, because BOINC's interaction with VirtualBox is fragile. And one hiccup could take out a year's worth of work. There are a few things the devs could do to fix this (to make VBoxWrapper more robust!), but right now they're understaffed.
-
- Mikrocruncher
- Beiträge: 30
- Registriert: 19.08.2017 13:56
Re: Long running work unit
I just disable the windows update service, then no updates ever occur until I check manually.
Well it (and the other machines) updated this morning, so I'm assuming the update caused the mess.
But I don't see why the checkpoints can't be used. The log you looked at of my task showed that it had saved successfully several times previously, so where has that data gone? You can't have a 100 day task and expect it to run for that length of time without problems, that's absurd! No wonder their tasks are never completing.
Well it (and the other machines) updated this morning, so I'm assuming the update caused the mess.
But I don't see why the checkpoints can't be used. The log you looked at of my task showed that it had saved successfully several times previously, so where has that data gone? You can't have a 100 day task and expect it to run for that length of time without problems, that's absurd! No wonder their tasks are never completing.
-
- Brain-Bug
- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: Long running work unit
If you'd like me to connect to your machine using TeamViewer, and take a look (to see if anything is reusable), I'd be willing. If so, send me a private message.
-
- Mikrocruncher
- Beiträge: 30
- Registriert: 19.08.2017 13:56
Re: Long running work unit
Can't be bothered sorting the firewall permissions on the router etc. If you want me to email you a file I could do that....
-
- Brain-Bug
- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: Long running work unit
I sent you a PM (private message).
-
- Brain-Bug
- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: Long running work unit
I took a quick look (Thanks Peter)... but it hadn't occurred to me that there'd be nothing to recover, as the task has been aborted. Sorry. Better luck next time.
- Michael H.W. Weber
- Vereinsvorstand
- Beiträge: 22431
- Registriert: 07.01.2002 01:00
- Wohnort: Marpurk
- Kontaktdaten:
Re: Long running work unit
It is because once a new checkpoint is written, it overwrites the previous one. If your machine crashes during this writing process, then everything is lost. It seems, exactly this has happened in your case. It is quite unprobable but unfortunately may happen...Peter Hucker hat geschrieben:But I don't see why the checkpoints can't be used.
Regarding the other many "failures": Most of these are tasks aborted by less patient people who are not aware that selecting the VM tasks imposes quite a challenge on their machines. Also, we previously had no checkpoints at all even for these long tasks just because the science application is written for HPC systems. Since this is quite ridiculous, we wrapped the virtual machine approach around that app to enable checkpoints - albeit in a bit uncomfortable way. Technically, however, these long tasks now do very well. But indeed, automatic Windows updates should be disabled...
Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.
http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B
http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B
-
- Mikrocruncher
- Beiträge: 30
- Registriert: 19.08.2017 13:56
Re: Long running work unit
Is there no way to make it finish saving the new checkpoint before deleting the previous one?
Also, why are the tasks so huge? Is there no way to split them up and hand parts to many users? That way you could get them completed much faster, and not run so much risk of losing them.
Also, when I said "why can't the checkpoints be used", what I actually meant was doesn't the server have a copy of where I got up to, so the unit can be handed to the next user half done?
Also, why are the tasks so huge? Is there no way to split them up and hand parts to many users? That way you could get them completed much faster, and not run so much risk of losing them.
Also, when I said "why can't the checkpoints be used", what I actually meant was doesn't the server have a copy of where I got up to, so the unit can be handed to the next user half done?
- Michael H.W. Weber
- Vereinsvorstand
- Beiträge: 22431
- Registriert: 07.01.2002 01:00
- Wohnort: Marpurk
- Kontaktdaten:
Re: Long running work unit
Well, apparently that is not that trivial...Peter Hucker hat geschrieben:Is there no way to make it finish saving the new checkpoint before deleting the previous one?
Because these tasks search exhaustively, meaning that the approximations usually resulting in the tremendous acceleration are omitted.Peter Hucker hat geschrieben:Also, why are the tasks so huge?
Yes, with a few clever tricks, there is. But implementing this still requires some efforts (which are on the to-do list).Peter Hucker hat geschrieben:Is there no way to split them up and hand parts to many users?
No, the server does in this respect actually not know anything about what the clients are doing. It only receives a trickle-up message how far the individual client is progressing to auto-adjust the tasks deadline on the server-side (run time prediction is actually very difficult for these types of calculations).Peter Hucker hat geschrieben:Also, when I said "why can't the checkpoints be used", what I actually meant was doesn't the server have a copy of where I got up to, so the unit can be handed to the next user half done?
Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.
http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B
http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B
-
- Brain-Bug
- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: Long running work unit
...and naturally, it is searching for another one? ^^