Is this normal for a VM WU?
Is this normal for a VM WU?
I have 2 VM workunits running currently and they both jumped to ~98 % done very quickly but still say they have 22 hours and more than 2 days left respectively to finish. Is it normal for these to go through progress so fast but still have so long to complete? Is there a way I can check and see what it is doing currently?
Thanks
Thanks
Re: Is this normal for a VM WU?
Can you please share a link to these workunits so we know about which one you are talking?
-
- Admin
- Beiträge: 1920
- Registriert: 23.02.2010 22:12
Re: Is this normal for a VM WU?
Yes, this is normal. The progress indicator does not seem to work with those very long workunits and is capped at 98.765%. You can check the timestamp of the progress.txt in the slot/X/shared/ directory. If this is still updated than the VM is still running.
Re: Is this normal for a VM WU?
Thanks Christian
@yoyo. This was one of them http://www.rnaworld.de/rnaworld/result. ... d=14921104
it ended up erroring out after 17 hours and the other one threw an air after 26 hours http://www.rnaworld.de/rnaworld/result. ... d=14921014
@yoyo. This was one of them http://www.rnaworld.de/rnaworld/result. ... d=14921104
it ended up erroring out after 17 hours and the other one threw an air after 26 hours http://www.rnaworld.de/rnaworld/result. ... d=14921014
Re: Is this normal for a VM WU?
Please check WUs from host 26666. They are looking good but there is "Error while computing"
Another issues are:
1) Boinc didn't stop VM after it finishes WU
2) Boinc didn't delete VMs after WU finish
3) VM didn't stop after closing boinc client...
4) After first checkpoint progress bar jumps to 98+%
Another issues are:
1) Boinc didn't stop VM after it finishes WU
2) Boinc didn't delete VMs after WU finish
3) VM didn't stop after closing boinc client...
4) After first checkpoint progress bar jumps to 98+%
-
- Admin
- Beiträge: 1920
- Registriert: 23.02.2010 22:12
Re: Is this normal for a VM WU?
It seems that this is the place where the error comes from:szopler hat geschrieben:Please check WUs from host 26666. They are looking good but there is "Error while computing"
Another issues are:
1) Boinc didn't stop VM after it finishes WU
2) Boinc didn't delete VMs after WU finish
3) VM didn't stop after closing boinc client...
4) After first checkpoint progress bar jumps to 98+%
Code: Alles auswählen
2013-10-23 18:38:16 (2604): Creating new snapshot for VM.
2013-10-23 18:46:32 (2604): Deleting stale snapshot.
2013-10-23 18:47:02 (2604): Checkpoint completed.
2013-10-23 18:48:40 (2604): Creating new snapshot for VM.
2013-10-23 21:01:12 (4948): vboxwrapper: starting
2013-10-23 21:01:12 (4948): Feature: Enabling trickle-ups (Interval: 14400.000000)
2013-10-23 21:01:12 (4948): Detected: VirtualBox 4.2.16r86992
2013-10-23 21:02:18 (4948): Restore from previously saved snapshot.
2013-10-23 21:02:44 (4948): Restore completed.
2013-10-23 21:02:44 (4948): Starting VM.
2013-10-23 21:06:01 (4948): Error in start VM for VM: -2135228409
Arguments:
startvm "boinc_64d12757f0347bf9" --type headless
Output:
VBoxManage.exe: error: The machine 'boinc_64d12757f0347bf9' is already locked by a session (or being locked or unlocked)
VBoxManage.exe: error: Details: code VBOX_E_INVALID_OBJECT_STATE (0x80bb0007), component Machine, interface IMachine, callee IUnknown
VBoxManage.exe: error: Context: "LaunchVMProcess(a->session, sessionType.raw(), env.raw(), progress.asOutParam())" at line 580 of file VBoxManageMisc.cpp
Notes:
Another VirtualBox management application has locked the session for
this VM. BOINC cannot properly monitor this VM
and so this job will be aborted.
Code: Alles auswählen
2013-10-23 15:13:01 (304): Checkpoint completed.
2013-10-23 15:15:52 (304): Powering off VM.
2013-10-23 15:46:44 (3180): vboxwrapper: starting
2013-10-23 15:46:44 (3180): Feature: Enabling trickle-ups (Interval: 14400.000000)
2013-10-23 15:46:44 (3180): Detected: VirtualBox 4.2.16r86992
2013-10-23 15:47:50 (3180): Restore from previously saved snapshot.
2013-10-23 15:48:16 (3180): Restore completed.
2013-10-23 15:48:16 (3180): Starting VM.
Code: Alles auswählen
2013-10-23 13:50:20 (4556): Checkpoint completed.
2013-10-23 13:58:44 (4556): Creating new snapshot for VM.
2013-10-23 14:00:33 (4556): Deleting stale snapshot.
2013-10-23 14:01:20 (4556): Checkpoint completed.
2013-10-23 14:02:54 (304): vboxwrapper: starting
2013-10-23 14:02:54 (304): Feature: Enabling trickle-ups (Interval: 14400.000000)
2013-10-23 14:02:54 (304): Detected: VirtualBox 4.2.16r86992
2013-10-23 14:04:00 (304): Restore from previously saved snapshot.
2013-10-23 14:04:22 (304): Restore completed.
2013-10-23 14:04:22 (304): Starting VM.
2013-10-23 14:05:23 (304): Successfully started VM.
2013-10-23 14:05:23 (304): Setting cpu throttle for VM. (100%)
Re: Is this normal for a VM WU?
Stopped BOINC just like i always do before going to bed everything seemed fine. Brought the computer out of hibernate this morning and restarted BOINC and another VM WU errored out.
This one here http://www.rnaworld.de/rnaworld/result. ... d=14921606
This one here http://www.rnaworld.de/rnaworld/result. ... d=14921606
Re: Is this normal for a VM WU?
How long does it really take to run one of these VM tasks (on a descent processor)?
cmsvm_GA-p[e30-50MB_Lin64f]_1_Oryzias-latipes-(Japanese-medaka)_DG000014.lin.EMBL_RF00028_Intron_gpI_1330438623_100184_3
http://www.rnaworld.de/rnaworld/result. ... d=14920935
Also watched the task reach 98.765% and then the progress didn't change (explained above). Elapsed time is >112h and remaining time is 26.5h.
The stderr is presently 172MB and the VM download was >800MB - a fairly hefty overhead.
If the GFLOPs is anything to go by (26369424) and I compare it to a climate model it will take 6times as long to run; 1477h (2months).
If that's the case it's a LONG WU, as in a XXL (long), just running in a VM. Something you might want to indicate!
My project settings
cmsvm_GA-p[e30-50MB_Lin64f]_1_Oryzias-latipes-(Japanese-medaka)_DG000014.lin.EMBL_RF00028_Intron_gpI_1330438623_100184_3
http://www.rnaworld.de/rnaworld/result. ... d=14920935
Also watched the task reach 98.765% and then the progress didn't change (explained above). Elapsed time is >112h and remaining time is 26.5h.
The stderr is presently 172MB and the VM download was >800MB - a fairly hefty overhead.
If the GFLOPs is anything to go by (26369424) and I compare it to a climate model it will take 6times as long to run; 1477h (2months).
If that's the case it's a LONG WU, as in a XXL (long), just running in a VM. Something you might want to indicate!
My project settings
- Run only the selected applications
cmsearch XXL (long) 1.0.2: no
cmcalibrate 1.0.2: yes
cmalign 1.0.2: yes
cmbuild 1.0.2: yes
cmsearch S (short) 1.0.2: yes
cmsearch VM (VirtualBox) 1.0.2: yes
If no work for selected applications is available, accept work from other applications? no
Re: Is this normal for a VM WU?
Hello skgiven,
I am somehow happy to see that somebody else is reporting the problem of an increasing stderr here. A problem I'm facing since two weeks with all running VM workunits.
1. The first download is > 800 MB since a template of the virtual disk image is downloaded, which is then reused for every new VM / workunit. You can find it in
\projects\www.rnaworld.de_rnaworld\
2. I guess you can use the estimated runtime on the reference system, which you find on the work package page, for a rough estimate. For your work package this is 16w 2d 8h 2m 25s.
3. On my three Win7 PCs the stderr is increasing with about 10 MB / hr., I have already deleted several GB of its content (total for all running workunits). I do that by simply copying an empty stderr.txt over the existing file with excessive size every two or three days. In case the stderr reappears with the same size like before (supposedly when it has been used in that very moment), just repeat the procedure.
4. The above will not stop the increase, however. I can stop the excessive increase of stderr by stopping BOINC and the applications, restarting the PC, manually starting BOINC again. But - and this is a big 'But' - stopping BOINC and restarting the PC has already led to several broken workunits for many users. Especially if more than one VM is running. So make sure all VM processes have been really stopped, whenever you restart BOINC or the PC. After a restart the stderr is increasing only slightly for about 12 to 20 hrs on my systems, containing only the normal snapshot reports. Then the error reappears and stderr begins to increase again with 10 MB / hr.
5. I'd recommend to have at least 6 GB of harddisk space available (for BOINC!) for every VM workunit. This is the current <rsc_disk_bound> limit, at least. I've even increased that to avoid any 'Maximum disk usage exceeded' error.
6. If you want to run more than one VM concurrently, make sure you have sufficient RAM available.
Hope this helps a bit. Good luck with your workunit.
Regards
I am somehow happy to see that somebody else is reporting the problem of an increasing stderr here. A problem I'm facing since two weeks with all running VM workunits.
1. The first download is > 800 MB since a template of the virtual disk image is downloaded, which is then reused for every new VM / workunit. You can find it in
\projects\www.rnaworld.de_rnaworld\
2. I guess you can use the estimated runtime on the reference system, which you find on the work package page, for a rough estimate. For your work package this is 16w 2d 8h 2m 25s.
3. On my three Win7 PCs the stderr is increasing with about 10 MB / hr., I have already deleted several GB of its content (total for all running workunits). I do that by simply copying an empty stderr.txt over the existing file with excessive size every two or three days. In case the stderr reappears with the same size like before (supposedly when it has been used in that very moment), just repeat the procedure.
4. The above will not stop the increase, however. I can stop the excessive increase of stderr by stopping BOINC and the applications, restarting the PC, manually starting BOINC again. But - and this is a big 'But' - stopping BOINC and restarting the PC has already led to several broken workunits for many users. Especially if more than one VM is running. So make sure all VM processes have been really stopped, whenever you restart BOINC or the PC. After a restart the stderr is increasing only slightly for about 12 to 20 hrs on my systems, containing only the normal snapshot reports. Then the error reappears and stderr begins to increase again with 10 MB / hr.
5. I'd recommend to have at least 6 GB of harddisk space available (for BOINC!) for every VM workunit. This is the current <rsc_disk_bound> limit, at least. I've even increased that to avoid any 'Maximum disk usage exceeded' error.
6. If you want to run more than one VM concurrently, make sure you have sufficient RAM available.
Hope this helps a bit. Good luck with your workunit.
Regards
-
- Brain-Bug
- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: Is this normal for a VM WU?
I'm in the same boat -- left wondering how long it will really take.
If I open VirtualBox Manager, right-click the VM, Settings -> Display -> Remote Display -> Enable Server -> Assign Port (I used 1122).... Then (while the task is running) use Remote Desktop to connect to localhost:1122 ... I can see the forecast lists 2 "lines" with a "prediction" run time total of 1353:28:25.34. But I also know that the cmsearch forecasts are wildly innaccurate, and cannot be trusted at all, according to previous posts I read.
My task has been running for 110 hours, and BOINC says Remaining (estimated) of 16.5 hours. But I think that's based on estimated fpops, which is wildly innacurate, and cannot be trusted at all.
So then I open the work unit on the website, and see:
estimated runtime on reference system 8w 4d 6h 35m 5s (5207705.7987105 s)
... but I wager that it, too, is wildly inaccurate, since we have no means to know how much "work" a given task has to do, compared to other tasks. (Well, we have separation between "Small" and "XL", but other than that, I don't think we know.
Suffice to say: We have no idea how long it will take.
Which brings me to my main question:
My task has a deadline of 11/11/2013. What happens if it's not done by then? Will the work be lost? Note: I don't want any manual intervention (like that thread where everyone asks for deadline extensions).... If this task goes over deadline, what will happen?
Thanks,
Jacob Klein
If I open VirtualBox Manager, right-click the VM, Settings -> Display -> Remote Display -> Enable Server -> Assign Port (I used 1122).... Then (while the task is running) use Remote Desktop to connect to localhost:1122 ... I can see the forecast lists 2 "lines" with a "prediction" run time total of 1353:28:25.34. But I also know that the cmsearch forecasts are wildly innaccurate, and cannot be trusted at all, according to previous posts I read.
My task has been running for 110 hours, and BOINC says Remaining (estimated) of 16.5 hours. But I think that's based on estimated fpops, which is wildly innacurate, and cannot be trusted at all.
So then I open the work unit on the website, and see:
estimated runtime on reference system 8w 4d 6h 35m 5s (5207705.7987105 s)
... but I wager that it, too, is wildly inaccurate, since we have no means to know how much "work" a given task has to do, compared to other tasks. (Well, we have separation between "Small" and "XL", but other than that, I don't think we know.
Suffice to say: We have no idea how long it will take.
Which brings me to my main question:
My task has a deadline of 11/11/2013. What happens if it's not done by then? Will the work be lost? Note: I don't want any manual intervention (like that thread where everyone asks for deadline extensions).... If this task goes over deadline, what will happen?
Thanks,
Jacob Klein