Long running work unit

Everything about the project RNA World
Nachricht
Autor
mister.marmot
Mikrocruncher
Mikrocruncher
Beiträge: 27
Registriert: 15.07.2016 00:53

Re: Long running work unit

#1081 Ungelesener Beitrag von mister.marmot » 01.11.2020 18:22

Jacob Klein hat geschrieben:
01.11.2020 03:40
I strongly advise you to not attempt an OS/Data migration, if you want the work unit to survive.
I've been through many migrations and have installed BOINC on the data drive in portable format w/ no dependencies on the Windows registry.
The machine name will not change.
This is my main machine and the OS and data folders have been steadily migrated since 1998 with apps from Windows 98 that still will function as the needed libraries are still available (although I use none of those now)

This laptop's LCD lamps dying and the screen's gone dark at the right side. I put on an external monitor. The other laptop's down to one cooling fan.
These machines need to be replaced.
I'll put it off as long as possible but I can not wait another year.

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 515
Registriert: 26.07.2013 15:41

Re: Long running work unit

#1082 Ungelesener Beitrag von Jacob Klein » 01.11.2020 22:36

Okay. Do whatever you think is correct. These fragile tasks require as few interruptions as possible, especially large disruptive ones! :) Good luck.

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 515
Registriert: 26.07.2013 15:41

Re: Long running work unit

#1083 Ungelesener Beitrag von Jacob Klein » 15.11.2020 23:03

YES!! Nitro has just completed another longer-runner!

WU 6330822: https://www.rnaworld.de/rnaworld/workun ... id=6330822
Task 14954587: https://www.rnaworld.de/rnaworld/result ... d=14954587

442.1d CPU Time :smoking:

This project still has 30 more tasks out there - KEEP THEM SAFE WHILE YOU CRUNCH THEM!

Thanks,
Jacob Klein

Benutzeravatar
gemini8
Vereinsmitglied
Vereinsmitglied
Beiträge: 4050
Registriert: 31.05.2011 10:30
Wohnort: Hannover

Re: Long running work unit

#1084 Ungelesener Beitrag von gemini8 » 16.11.2020 00:01

Cool!
Keep finishing them off! ;-)
Gruß, Jens
- - - - - -
Lowend-User und Teilzeitcruncher

Bild Bild Bild
Bild

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 20986
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: Long running work unit

#1085 Ungelesener Beitrag von Michael H.W. Weber » 16.11.2020 07:52

Very nice. :D

Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

mister.marmot
Mikrocruncher
Mikrocruncher
Beiträge: 27
Registriert: 15.07.2016 00:53

Re: Long running work unit

#1086 Ungelesener Beitrag von mister.marmot » 09.12.2020 09:05

My laptop had a blue screen of death from a video driver malfunction.

Task 14954628 (376 days) shows a computation error. http://www.rnaworld.de/rnaworld/result. ... d=14954628

Anyone know a way to purge the computation error on the WU and start from the last save point?
I should be able to recreate a VM machine, attach it to the vdi drive in the folder and then set a save point w/ the xml in the folder.
Can the dev walk me through this over a chat?

It's horrible to lose 376 days of work when it's maybe 70 days from completion.

I have BOINC paused on this machine so that it does not report the WU till I have an specialist attempt a repair with me.

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 515
Registriert: 26.07.2013 15:41

Re: Long running work unit

#1087 Ungelesener Beitrag von Jacob Klein » 09.12.2020 09:31

Do you have a backup of the VM, either in VirtualBox or as a copy of the slots folder? If not, then I probably can't help. PS: I'm not a dev for this project, but I know a lot about VirtualBox and how this project uses it.

mister.marmot
Mikrocruncher
Mikrocruncher
Beiträge: 27
Registriert: 15.07.2016 00:53

Re: Long running work unit

#1088 Ungelesener Beitrag von mister.marmot » 09.12.2020 10:26

Jacob Klein hat geschrieben:
09.12.2020 09:31
Do you have a backup of the VM, either in VirtualBox or as a copy of the slots folder? If not, then I probably can't help. PS: I'm not a dev for this project, but I know a lot about VirtualBox and how this project uses it.

I just used data recovery to grab slot 4 which had files dated 1:44am ( few minutes before the crash):
result_1.dat
progress
check.dat
boinc_lockfile

but there is no VBox *.xml file in that slot.
Is that xml file located somewhere else?
There is a cms_vbox_job_1.17.xml in the project folder plus slideshow_01 to slideshow_08 date marked just at the crash time.

Also recovered 2 data files in the rnaworld project directory:
cmsvm2_GA-p[e20-30MB_Lin64f]_1_Oryza-sativa-Indica-Group_CM000132.lin.EMBL_RF00028_Intron_gpI_1349111823_43932-para.zip
cmsvm2_GA-p[e20-30MB_Lin64f]_1_Oryza-sativa-Indica-Group_CM000132.lin.EMBL_RF00028_Intron_gpI_1349111823_43932-in.zip

There was a slot 9 that was completely empty after the laptop recovery restart.
Trying a deep data recovery scan to see if that slot can be recovered.
UPDATE: No slot 9 recoverable.

I didn't stop BOINC quickly enough so another user got the WU.
If it hadn't succeeded in reporting the errored WU I have recovered the clientstate.xml of BOINC just before the crash.
UPDATE: recovered clientstate.xml BOINC files are corrupted. No way to restore BOINC state prior to the crash.
Guess all that is left is a possible manually controlled run with results reported manually if possible.

This makes me sad... I'm actually in tears.

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 515
Registriert: 26.07.2013 15:41

Re: Long running work unit

#1089 Ungelesener Beitrag von Jacob Klein » 09.12.2020 17:29

I'm sorry, but it sounds like it's not recoverable. What I typically do is .... every week, close BOINC, make sure that all VirtualBox processes are also closed (either gracefully on their own, or by force), then open VirtualBox to make a Clone of the VM, to save as a backup. Then close VirtualBox, and wait for its processes to close. Then start BOINC again. I also have my BOINC setup to not auto-start with Windows, since it will auto-report failures, even though I might have a chance to restore things offline, and test offline, to correct things offline, before allowing it to start online.

I know this is information that is ... too late, for your scenario. Again, sorry.

But now might be a good time to familiarize yourself with VirtualBox, and cloning, etc. For the next time that you get a monster!

mister.marmot
Mikrocruncher
Mikrocruncher
Beiträge: 27
Registriert: 15.07.2016 00:53

Re: Long running work unit

#1090 Ungelesener Beitrag von mister.marmot » 10.12.2020 13:59

Jacob Klein hat geschrieben:
09.12.2020 17:29
I'm sorry, but it sounds like it's not recoverable. What I typically do is .... every week, close BOINC, make sure that all VirtualBox processes are also closed (either gracefully on their own, or by force), then open VirtualBox to make a Clone of the VM, to save as a backup. Then close VirtualBox, and wait for its processes to close. Then start BOINC again. I also have my BOINC setup to not auto-start with Windows, since it will auto-report failures, even though I might have a chance to restore things offline, and test offline, to correct things offline, before allowing it to start online.

I know this is information that is ... too late, for your scenario. Again, sorry.

But now might be a good time to familiarize yourself with VirtualBox, and cloning, etc. For the next time that you get a monster!
Thankyou for the information.

I have been creating VM's with Virtual Box for years. My 3 servers are running 8 to 12, 4-core, minimalist OS (Windows 7 or Anti-X), BOINC installs that I update from an original then propagate the new version to the servers. Customers have hired me to transfer some of their older machines, with games or research, to VM's so they can run it from their new workstations.

I did a superb job of keeping this laptop running uninterrupted long stretches; it had only restarted 5 times during the life of this 376day WU and was only hibernated for a few hours in power failures during thunderstorms.
It's a 2008 model and was eventually going to have a failure.
For a BOINC project, it is beyond the scope of my responsibilities to clone or backup a work unit (VM). It is the responsibility of the project developers and managers to assure a WU is either a small slice of work so that the loss of a single WU is insignificant to the user/project, that the WU is failed if it appears the client computer is incapable of completing in a reasonable time (deadlines), or that the WU is properly hardened against loss if it typically takes longer than 24-72 hours to complete.

Honestly, life is way too short, with way too many other joys and responsibilities, to add additional burden to users of protecting WU's.

I do not think it possible to get another of these WU's. If one did arrive, the main computing equipment would not be able to complete it as they have no battery backups and a random power glitch (3 last year: 2 from high winds, 1 from snow falling off the lines) could kill this WU ( I always shut them down if storms are coming) also because of climate change, the machines are off in the summer months. It's not ethical to use coal fired plant electricity (35% of our area still is coal) to cool 64 cores in the summer but they are excellent home heating the rest of the year.

Will these last 30 WU's, that are taking 400+ days, still be relevant to the science?
Is a competing research team going to beat the project to paper publication?

Really would like to have completed this WU as a badge of honor and for Free-DC and WUProps badges.

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 515
Registriert: 26.07.2013 15:41

Re: Long running work unit

#1091 Ungelesener Beitrag von Jacob Klein » 13.12.2020 05:32

We do the best we can. They don't have an easy way to break the task up any farther, so ... we're lucky in fact to even have resumeable VMs for them! (We didn't use to have that!)

Here's an example of me getting to ready to delete my oldest weekly backup for a VM.
You can see that I VERY CAREFULLY (ie: Only when BOINC is exited, VirtualBox processes have closed) ... keep 3 weeks of rolling backups, per RNA World VM. Fun.
This has allowed me to resume work toward completion, for some that have failed, with permission from the project admins.
1.png
1.png (43.76 KiB) 263 mal betrachtet
"Fun."

Antworten

Zurück zu „RNA World Discussions (english)“