Is this normal for a VM WU?

Nachricht

Plomos · #1 Ungelesener Beitrag von **Plomos** » 22.10.2013 21:59

I have 2 VM workunits running currently and they both jumped to ~98 % done very quickly but still say they have 22 hours and more than 2 days left respectively to finish. Is it normal for these to go through progress so fast but still have so long to complete? Is there a way I can check and see what it is doing currently?

Thanks

#2 Ungelesener Beitrag von **yoyo** » 23.10.2013 07:44

Can you please share a link to these workunits so we know about which one you are talking?

ChristianB · #3 Ungelesener Beitrag von **ChristianB** » 23.10.2013 07:47

Yes, this is normal. The progress indicator does not seem to work with those very long workunits and is capped at 98.765%. You can check the timestamp of the progress.txt in the slot/X/shared/ directory. If this is still updated than the VM is still running.

Plomos · #4 Ungelesener Beitrag von **Plomos** » 23.10.2013 23:31

Thanks Christian

@yoyo. This was one of them http://www.rnaworld.de/rnaworld/result. ... d=14921104

it ended up erroring out after 17 hours and the other one threw an air after 26 hours http://www.rnaworld.de/rnaworld/result. ... d=14921014

szopler · #5 Ungelesener Beitrag von **szopler** » 24.10.2013 22:30

Please check WUs from host 26666. They are looking good but there is "Error while computing"

Another issues are:
1) Boinc didn't stop VM after it finishes WU
2) Boinc didn't delete VMs after WU finish
3) VM didn't stop after closing boinc client...
4) After first checkpoint progress bar jumps to 98+%

ChristianB · #6 Ungelesener Beitrag von **ChristianB** » 25.10.2013 13:03

szopler hat geschrieben:Please check WUs from host 26666. They are looking good but there is "Error while computing"

Another issues are:
1) Boinc didn't stop VM after it finishes WU
2) Boinc didn't delete VMs after WU finish
3) VM didn't stop after closing boinc client...
4) After first checkpoint progress bar jumps to 98+%

It seems that this is the place where the error comes from:

Code: Alles auswählen

2013-10-23 18:38:16 (2604): Creating new snapshot for VM.
2013-10-23 18:46:32 (2604): Deleting stale snapshot.
2013-10-23 18:47:02 (2604): Checkpoint completed.
2013-10-23 18:48:40 (2604): Creating new snapshot for VM.
2013-10-23 21:01:12 (4948): vboxwrapper: starting
2013-10-23 21:01:12 (4948): Feature: Enabling trickle-ups (Interval: 14400.000000)
2013-10-23 21:01:12 (4948): Detected: VirtualBox 4.2.16r86992
2013-10-23 21:02:18 (4948): Restore from previously saved snapshot.
2013-10-23 21:02:44 (4948): Restore completed.
2013-10-23 21:02:44 (4948): Starting VM.
2013-10-23 21:06:01 (4948): Error in start VM for VM: -2135228409
Arguments:
startvm "boinc_64d12757f0347bf9" --type headless
Output:
VBoxManage.exe: error: The machine 'boinc_64d12757f0347bf9' is already locked by a session (or being locked or unlocked)
VBoxManage.exe: error: Details: code VBOX_E_INVALID_OBJECT_STATE (0x80bb0007), component Machine, interface IMachine, callee IUnknown
VBoxManage.exe: error: Context: "LaunchVMProcess(a->session, sessionType.raw(), env.raw(), progress.asOutParam())" at line 580 of file VBoxManageMisc.cpp

Notes:

Another VirtualBox management application has locked the session for
this VM. BOINC cannot properly monitor this VM
and so this job will be aborted.

So the VM was still running after 18:48. Do you know what you did at this time? Did you kill or stop the client? At other times the VM shutdown/restart seems to work:

Code: Alles auswählen

2013-10-23 15:13:01 (304): Checkpoint completed.
2013-10-23 15:15:52 (304): Powering off VM.
2013-10-23 15:46:44 (3180): vboxwrapper: starting
2013-10-23 15:46:44 (3180): Feature: Enabling trickle-ups (Interval: 14400.000000)
2013-10-23 15:46:44 (3180): Detected: VirtualBox 4.2.16r86992
2013-10-23 15:47:50 (3180): Restore from previously saved snapshot.
2013-10-23 15:48:16 (3180): Restore completed.
2013-10-23 15:48:16 (3180): Starting VM.

And even for shorter timeslices it's working:

Code: Alles auswählen

2013-10-23 13:50:20 (4556): Checkpoint completed.
2013-10-23 13:58:44 (4556): Creating new snapshot for VM.
2013-10-23 14:00:33 (4556): Deleting stale snapshot.
2013-10-23 14:01:20 (4556): Checkpoint completed.
2013-10-23 14:02:54 (304): vboxwrapper: starting
2013-10-23 14:02:54 (304): Feature: Enabling trickle-ups (Interval: 14400.000000)
2013-10-23 14:02:54 (304): Detected: VirtualBox 4.2.16r86992
2013-10-23 14:04:00 (304): Restore from previously saved snapshot.
2013-10-23 14:04:22 (304): Restore completed.
2013-10-23 14:04:22 (304): Starting VM.
2013-10-23 14:05:23 (304): Successfully started VM.
2013-10-23 14:05:23 (304): Setting cpu throttle for VM. (100%)

It seems that sometimes it's working and sometimes not. Let's find out what happened in those cases and hopefully find a difference so we can fix it in the vboxwrapper.

Plomos · #7 Ungelesener Beitrag von **Plomos** » 26.10.2013 15:26

Stopped BOINC just like i always do before going to bed everything seemed fine. Brought the computer out of hibernate this morning and restarted BOINC and another VM WU errored out.

This one here http://www.rnaworld.de/rnaworld/result. ... d=14921606

skgiven · #8 Ungelesener Beitrag von **skgiven** » 27.10.2013 09:33

How long does it really take to run one of these VM tasks (on a descent processor)?

cmsvm_GA-p[e30-50MB_Lin64f]_1_Oryzias-latipes-(Japanese-medaka)_DG000014.lin.EMBL_RF00028_Intron_gpI_1330438623_100184_3
http://www.rnaworld.de/rnaworld/result. ... d=14920935

Also watched the task reach 98.765% and then the progress didn't change (explained above). Elapsed time is >112h and remaining time is 26.5h.
The stderr is presently 172MB and the VM download was >800MB - a fairly hefty overhead.

If the GFLOPs is anything to go by (26369424) and I compare it to a climate model it will take 6times as long to run; 1477h (2months).
If that's the case it's a LONG WU, as in a XXL (long), just running in a VM. Something you might want to indicate!

My project settings

Run only the selected applications
cmsearch XXL (long) 1.0.2: no
cmcalibrate 1.0.2: yes
cmalign 1.0.2: yes
cmbuild 1.0.2: yes
cmsearch S (short) 1.0.2: yes
cmsearch VM (VirtualBox) 1.0.2: yes
If no work for selected applications is available, accept work from other applications? no

Beorn · #9 Ungelesener Beitrag von **Beorn** » 27.10.2013 14:36

Hello skgiven,

I am somehow happy to see that somebody else is reporting the problem of an increasing stderr here.

A problem I'm facing since two weeks with all running VM workunits.

1. The first download is > 800 MB since a template of the virtual disk image is downloaded, which is then reused for every new VM / workunit. You can find it in
\projects\www.rnaworld.de_rnaworld\

2. I guess you can use the estimated runtime on the reference system, which you find on the work package page, for a rough estimate. For your work package this is 16w 2d 8h 2m 25s.

3. On my three Win7 PCs the stderr is increasing with about 10 MB / hr., I have already deleted several GB of its content (total for all running workunits). I do that by simply copying an empty stderr.txt over the existing file with excessive size every two or three days. In case the stderr reappears with the same size like before (supposedly when it has been used in that very moment), just repeat the procedure.

4. The above will not stop the increase, however. I can stop the excessive increase of stderr by stopping BOINC and the applications, restarting the PC, manually starting BOINC again. But - and this is a big 'But' - stopping BOINC and restarting the PC has already led to several broken workunits for many users. Especially if more than one VM is running. So make sure all VM processes have been really stopped, whenever you restart BOINC or the PC. After a restart the stderr is increasing only slightly for about 12 to 20 hrs on my systems, containing only the normal snapshot reports. Then the error reappears and stderr begins to increase again with 10 MB / hr.

5. I'd recommend to have at least 6 GB of harddisk space available (for BOINC!) for every VM workunit. This is the current <rsc_disk_bound> limit, at least. I've even increased that to avoid any 'Maximum disk usage exceeded' error.

6. If you want to run more than one VM concurrently, make sure you have sufficient RAM available.

Hope this helps a bit. Good luck with your workunit.

Regards

Jacob Klein · #10 Ungelesener Beitrag von **Jacob Klein** » 01.11.2013 04:46

I'm in the same boat -- left wondering how long it will really take.

If I open VirtualBox Manager, right-click the VM, Settings -> Display -> Remote Display -> Enable Server -> Assign Port (I used 1122).... Then (while the task is running) use Remote Desktop to connect to localhost:1122 ... I can see the forecast lists 2 "lines" with a "prediction" run time total of 1353:28:25.34. But I also know that the cmsearch forecasts are wildly innaccurate, and cannot be trusted at all, according to previous posts I read.

My task has been running for 110 hours, and BOINC says Remaining (estimated) of 16.5 hours. But I think that's based on estimated fpops, which is wildly innacurate, and cannot be trusted at all.

So then I open the work unit on the website, and see:
estimated runtime on reference system 8w 4d 6h 35m 5s (5207705.7987105 s)
... but I wager that it, too, is wildly inaccurate, since we have no means to know how much "work" a given task has to do, compared to other tasks. (Well, we have separation between "Small" and "XL", but other than that, I don't think we know.

Suffice to say: We have no idea how long it will take.

Which brings me to my main question:
My task has a deadline of 11/11/2013. What happens if it's not done by then? Will the work be lost? Note: I don't want any manual intervention (like that thread where everyone asks for deadline extensions).... If this task goes over deadline, what will happen?

Thanks,
Jacob Klein

Rechenkraft.net e.V.

Is this normal for a VM WU?

Is this normal for a VM WU?

Re: Is this normal for a VM WU?

Re: Is this normal for a VM WU?

Re: Is this normal for a VM WU?

Re: Is this normal for a VM WU?

Re: Is this normal for a VM WU?

Re: Is this normal for a VM WU?

Re: Is this normal for a VM WU?

Re: Is this normal for a VM WU?

Re: Is this normal for a VM WU?