Long running work unit

Everything about the project RNA World
Nachricht
Autor
Peter Hucker
Mikrocruncher
Mikrocruncher
Beiträge: 30
Registriert: 19.08.2017 13:56

Re: Long running work unit

#901 Ungelesener Beitrag von Peter Hucker » 31.08.2017 11:34

ChristianB hat geschrieben:The thing is. vboxwrapper is supposed to do just what you all described. Keep the snapshot around until the next was written successfully. Unfortunately if the snapshot operation itself is disturbed by something it is still recognized as the last snapshot (although not usable) and vboxwrapper will not try to revert to the snapshot before that. This is hard to reproduce which is why nobody investigated this rare occurrence given the resource constraints at Rechenkraft and BOINC.
Could it not be reproduced by cutting the power to a computer while it's writing?

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Long running work unit

#902 Ungelesener Beitrag von Jacob Klein » 31.08.2017 11:48

:) That's what I expected Christian. It'd be nice if VBoxWrapper could see that multiple snapshots are in the VM, and if the newest doesn't work, keep trying older ones until you find one that works.

Any chance of a code change to do that magic?

ChristianB
Admin
Admin
Beiträge: 1920
Registriert: 23.02.2010 22:12

Re: Long running work unit

#903 Ungelesener Beitrag von ChristianB » 31.08.2017 16:23

Peter Hucker hat geschrieben:
ChristianB hat geschrieben:The thing is. vboxwrapper is supposed to do just what you all described. Keep the snapshot around until the next was written successfully. Unfortunately if the snapshot operation itself is disturbed by something it is still recognized as the last snapshot (although not usable) and vboxwrapper will not try to revert to the snapshot before that. This is hard to reproduce which is why nobody investigated this rare occurrence given the resource constraints at Rechenkraft and BOINC.
Could it not be reproduced by cutting the power to a computer while it's writing?
Of course. If you find a developer who is willing to do this to his development machine please let me know. It might be possible to programmatically interrupt VBox when writing to simulate that but I don't know enough about this to be sure or to investigate.

Jacob: Rom was the only one working on vboxwrapper and he is not working on BOINC anymore. There is a guy at CERN who does some work on vboxwrapper for LHC@home but they don't use snapshots so they don't touch that. In respect to RNA World I have more important projects that I want to do before looking into vboxwrapper.

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Long running work unit

#904 Ungelesener Beitrag von Jacob Klein » 31.08.2017 17:45

I understand, Christian. I was asking from a more "academic" "is it even possible to do" perspective :)

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Long running work unit

#905 Ungelesener Beitrag von Jacob Klein » 01.09.2017 21:43


Benutzeravatar
gemini8
Vereinsvorstand
Vereinsvorstand
Beiträge: 5898
Registriert: 31.05.2011 10:30
Wohnort: Hannover

Re: Long running work unit

#906 Ungelesener Beitrag von gemini8 » 02.09.2017 22:16

Yeah, bring them down!
Gruß, Jens
- - - - - -
Lowend-User und Teilzeit-Cruncher

Bild Bild Bild
Bild

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Long running work unit

#907 Ungelesener Beitrag von Jacob Klein » 20.09.2017 04:15

RacerX, after having a bit of a heat stroke, proved that he loves his new water cooler...
... by COMPLETING his longest-running-task-ever, less than a day after hardware installation!
355.4d :smoking:
http://www.rnaworld.de/rnaworld/result. ... d=14952308

This was his last v1.17 task, so ... he'll be moving to the Fast Ring partition, as soon as I get Zathras figured out.

Benutzeravatar
gemini8
Vereinsvorstand
Vereinsvorstand
Beiträge: 5898
Registriert: 31.05.2011 10:30
Wohnort: Hannover

Re: Long running work unit

#908 Ungelesener Beitrag von gemini8 » 20.09.2017 06:47

Congrats again, Mr. Monster-Hunter! :-D
Gruß, Jens
- - - - - -
Lowend-User und Teilzeit-Cruncher

Bild Bild Bild
Bild

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Long running work unit

#909 Ungelesener Beitrag von Jacob Klein » 26.09.2017 13:46

Speed just finished a near-record-breaker for him.
303.7d :smoking:
http://www.rnaworld.de/rnaworld/workuni ... id=6330837

Benutzeravatar
gemini8
Vereinsvorstand
Vereinsvorstand
Beiträge: 5898
Registriert: 31.05.2011 10:30
Wohnort: Hannover

Re: Long running work unit

#910 Ungelesener Beitrag von gemini8 » 26.09.2017 19:14

More than three years for validating. Impressive...
Gruß, Jens
- - - - - -
Lowend-User und Teilzeit-Cruncher

Bild Bild Bild
Bild

robertmiles
XBOX360-Installer
XBOX360-Installer
Beiträge: 86
Registriert: 23.02.2010 18:43
Wohnort: northern Alabama, US

Re: Long running work unit

#911 Ungelesener Beitrag von robertmiles » 14.10.2017 06:45

This task completed in 2012, apparently successfully, but is still waiting for validation:

http://www.rnaworld.de/rnaworld/result. ... d=14832293

I suspect that it reached a limit of 50 failures for wingmates.

Could you check if repeating it under VM would give a useful output to compare it to?


This workunit appears to be stuck at 98.765% progress; the progress has not changed in the last 86 hours. However, a checkpoint was written within the last 10 minutes.

http://www.rnaworld.de/rnaworld/result. ... d=14953356

It was interrupted by what appeared to be a Windows 10 update, followed by running several MT tasks from another BOINC project that required using all CPU cores BOINC is allowed to use.

The estimated remaining time is counting down by one second every second 9 times, then jumping up by 9 seconds in the next second, then repeating that over and over.

Is this normal behavior for a task apparently that close to completion? For example, is the progress not allowed to advance past 98.765% until the VirtualBox portion finishes? Is the VirtualBox portion telling the wrapper enough information that the wrapper can base its progress percentage on something more than the elapsed time as a percentage of the time it estimates the task will run?

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Long running work unit

#912 Ungelesener Beitrag von Jacob Klein » 14.10.2017 07:04

robertmiles hat geschrieben:This task completed in 2012, apparently successfully, but is still waiting for validation:
http://www.rnaworld.de/rnaworld/result. ... d=14832293
I suspect that it reached a limit of 50 failures for wingmates.
Could you check if repeating it under VM would give a useful output to compare it to?
Michael may be able to confirm, but ... it's likely that that work unit was issued using the "cmsearch XXL (long)" app (before VM apps existed), and in order to get a VM wingman for it, they set initial replication to 0 to not send any more under that app, and your wingman has a different work unit with initial replication of 1 using the "cmsearch VM (VirtualBox) 1.0.2" app.
robertmiles hat geschrieben:This workunit appears to be stuck at 98.765% progress; the progress has not changed in the last 86 hours.
http://www.rnaworld.de/rnaworld/result. ... d=14953356
It was interrupted by what appeared to be a Windows 10 update, followed by running several MT task from another BOINC project that required using all CPU cores BOINC is allowed to use.

The estimated remaining time is counting down by one second every second 9 times, then jumping up by 9 seconds in the next second, then repeating that over and over.

Is this normal behavior for a task apparently that close to completion? For example, is the progress not allowed to advance past 98.765% until the VirtualBox portion finishes? Is the VirtualBox portion telling the wrapper enough information that the wrapper can base its progress percentage on something more than the elapsed time as a percentage of the time it estimates the task will run?
We've explained this before. YES, it is normal for it to be at 98.765%, and remain there. Also, time remaining IRRELEVANT for these VM tasks, and the behavior you are seeing (counts down, then jumps to larger value), is completely normal. The reasoning is that, it did it's best job to estimate how long it'd take, but it was way off. I've seen it off by a factor of 3 or 4, before. When it is off, the task will remain at 98.765%, until completion. The rule is: If the task is still using CPU, then DO NOT ABORT.

I have had tasks at 98.765% for many many many months, before they completed successfully. Completely normal. These are MONSTERS. For your particular task, a real rough estimate might be that it will take 6 months to 18 months to complete. A wingman hasn't completed it, so I can't give any better estimates, sorry. But your work unit's "estimated runtime on reference system" value of 14 weeks (also a lousy estimate, fyi) ... is one of the largest ones that I've seen.

For your reference, here are some data points on my completed tasks ... indicating when they went to "98.765%", and when they completed:
98.765%, Completed
106.9d, 303.7d
105.9d, 192.4d
123.1d, 216.4d
117.1d, 144.7d
118.3d, 320.3d
126.3d, 189.4d
99.4d, 228.5d
100.9d, 276.1d
177.3d, 355.4d
165.2d, 318.8d
144.6d, 226.9d
225.0d, 353.8d
192.2d, 555.9d

I'm working on a HUGE MONSTER right now, on a slow laptop, where .... it went 98.765% at 196.5d, and is currently at 540d ! Talk about patience!

So, as you can see.... you may only be a third the way complete, when you reach that 98.765% point. MONSTERS.
If they're still using CPU, and you feel like continuing the challenge, then DO NOT ABORT. :)

I'd be careful about the upcoming Fall Creators Update, though. I'm not sure which version of VirtualBox is/isn't compatible. If you are using a v5.1.x version, you might want to carefully close BOINC then upgrade to the LATEST v5.1.x version, at some point. But do NOT upgrade to v5.2.x, because your v1.18 task NEEDS v5.1.x.

Regards,
Jacob Klein

Antworten

Zurück zu „RNA World Discussions (english)“