Assistance needed - 2 Long-running VMs just failed - Clones

Everything about the project RNA World
Nachricht
Autor
Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Assistance needed - 2 Long-running VMs just failed - Clones

#1 Ungelesener Beitrag von Jacob Klein » 02.07.2014 14:35

Christian,

I just had my 2 big long-running (2000+ hr each, across 4-5 months each) RNA World VM tasks, fail, at nearly the exact same time, while I was away from the machine. Sad day. I do, however, have full VM clones from 2 days ago, for both tasks. The details of the crashes are below, including the Event Log, which I did not see anything suspicious with (except a GPUGrid task completed 14 seconds earlier, but probably did not interfere).

It looks to me like one task said got "virtualbox.exe/vboxheadless.exe is no longer running" at 07:05:32, and then both tasks went into "Poweroff" at 07:05:36.

My questions are:
- Do we have any idea what may have triggered this to happen? Maybe did one of them complete?
- Would you like to attempt anything with my clones? I actually have 17 clones of each task, at various completion levels.
- How can we/I prevent this problem in the future?
- Is there any additional information you might want that I can provide?

----------------------------------------------------------------------------------------------------
Task 1:
http://www.rnaworld.de/rnaworld/result. ... d=14924361
http://www.rnaworld.de/rnaworld/workuni ... id=6330945
Created 6 Feb 2014, 12:47:35 UTC
Name cmsvm2_GA-p[e20-30MB_Lin64f]_1_Oryza-sativa-Japonica-Group_CM000142.lin.EMBL_RF00028_Intron_gpI_1349111823_57652_12
Outcome Computation error
Client state Compute error
Run time 10,849,056.78
CPU time 10,810,220.00
Application version cmsearch VM (VirtualBox) 1.0.2 v1.12 (vbox64)
stderr:
<core_client_version>7.4.8</core_client_version>
<![CDATA[
<stderr_txt>
....
2014-07-02 07:02:21 (4304): Creating new snapshot for VM.
2014-07-02 07:02:21 (4304): Restoring VM Process priority.
2014-07-02 07:03:02 (4304): Lowering VM Process priority.
2014-07-02 07:03:03 (4304): Deleting stale snapshot.
2014-07-02 07:03:09 (4304): Checkpoint completed.
2014-07-02 07:05:32 (4304): Status Report: virtualbox.exe/vboxheadless.exe is no longer running.
2014-07-02 07:05:36 (4304): VM is no longer is a running state. It is in 'poweroff'.
2014-07-02 07:05:36 (4304): VM state change detected. (old = 'running', new = 'poweroff')
2014-07-02 07:05:36 (4304): Powering off VM.
2014-07-02 07:05:36 (4304): Deregistering VM.
2014-07-02 07:05:36 (4304): Deleting stale snapshot.
2014-07-02 07:05:37 (4304): Removing network bandwidth throttle group from VM.
2014-07-02 07:05:38 (4304): Removing storage controller(s) from VM.
2014-07-02 07:05:38 (4304): Removing VM from VirtualBox.
2014-07-02 07:05:39 (4304): Removing virtual disk drive from VirtualBox.
2014-07-02 07:05:44 (4304): Virtual machine exited.

VM Guest Log:

07:05:44 (4304): called boinc_finish
</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>cmsvm2_GA-p[e20-30MB_Lin64f]_1_Oryza-sativa-Japonica-Group_CM000142.lin.EMBL_RF00028_Intron_gpI_1349111823_57652_12_0</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>cmsvm2_GA-p[e20-30MB_Lin64f]_1_Oryza-sativa-Japonica-Group_CM000142.lin.EMBL_RF00028_Intron_gpI_1349111823_57652_12_1</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>
]]>

----------------------------------------------------------------------------------------------------
Task 2:
http://www.rnaworld.de/rnaworld/result. ... d=14924755
http://www.rnaworld.de/rnaworld/workuni ... id=6330939
Name cmsvm2_GA-p[e20-30MB_Lin64f]_1_Drosophila-melanogaster-(fruit-fly)_AE014297.lin.EMBL_RF00028_Intron_gpI_1349111823_13748_9
Created 12 Mar 2014, 20:34:38 UTC
Outcome Computation error
Client state Compute error
Run time 8,380,046.32
CPU time 8,345,739.00
Application version cmsearch VM (VirtualBox) 1.0.2 v1.12 (vbox64)
stderr:
<core_client_version>7.4.8</core_client_version>
<![CDATA[
<stderr_txt>
....
2014-07-02 07:02:32 (4800): Creating new snapshot for VM.
2014-07-02 07:02:32 (4800): Restoring VM Process priority.
2014-07-02 07:03:12 (4800): Lowering VM Process priority.
2014-07-02 07:03:13 (4800): Deleting stale snapshot.
2014-07-02 07:03:18 (4800): Checkpoint completed.
2014-07-02 07:05:36 (4800): VM is no longer is a running state. It is in 'poweroff'.
2014-07-02 07:05:36 (4800): VM state change detected. (old = 'running', new = 'poweroff')
2014-07-02 07:05:36 (4800): Powering off VM.
2014-07-02 07:05:36 (4800): Deregistering VM.
2014-07-02 07:05:37 (4800): Deleting stale snapshot.
2014-07-02 07:05:38 (4800): Removing network bandwidth throttle group from VM.
2014-07-02 07:05:38 (4800): Removing storage controller(s) from VM.
2014-07-02 07:05:38 (4800): Removing VM from VirtualBox.
2014-07-02 07:05:39 (4800): Removing virtual disk drive from VirtualBox.
2014-07-02 07:05:44 (4800): Virtual machine exited.

VM Guest Log:

07:05:44 (4800): called boinc_finish
</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>cmsvm2_GA-p[e20-30MB_Lin64f]_1_Drosophila-melanogaster-(fruit-fly)_AE014297.lin.EMBL_RF00028_Intron_gpI_1349111823_13748_9_0</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>cmsvm2_GA-p[e20-30MB_Lin64f]_1_Drosophila-melanogaster-(fruit-fly)_AE014297.lin.EMBL_RF00028_Intron_gpI_1349111823_13748_9_1</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>
]]>

----------------------------------------------------------------------------------------------------
Event Log:
7/2/2014 7:05:33 AM | | [work_fetch] Request work fetch: application exited
7/2/2014 7:05:33 AM | GPUGRID | Computation for task 112-NOELIA_2NDMG-0-3-RND1661_0 finished
7/2/2014 7:05:47 AM | | [work_fetch] Request work fetch: application exited
7/2/2014 7:05:47 AM | | [work_fetch] Request work fetch: application exited
7/2/2014 7:05:47 AM | | [work_fetch] Request work fetch: application exited
7/2/2014 7:05:47 AM | RNA World | Computation for task cmsvm2_GA-p[e20-30MB_Lin64f]_1_Oryza-sativa-Japonica-Group_CM000142.lin.EMBL_RF00028_Intron_gpI_1349111823_57652_12 finished
7/2/2014 7:05:47 AM | RNA World | Output file cmsvm2_GA-p[e20-30MB_Lin64f]_1_Oryza-sativa-Japonica-Group_CM000142.lin.EMBL_RF00028_Intron_gpI_1349111823_57652_12_0 for task cmsvm2_GA-p[e20-30MB_Lin64f]_1_Oryza-sativa-Japonica-Group_CM000142.lin.EMBL_RF00028_Intron_gpI_1349111823_57652_12 absent
7/2/2014 7:05:47 AM | RNA World | Output file cmsvm2_GA-p[e20-30MB_Lin64f]_1_Oryza-sativa-Japonica-Group_CM000142.lin.EMBL_RF00028_Intron_gpI_1349111823_57652_12_1 for task cmsvm2_GA-p[e20-30MB_Lin64f]_1_Oryza-sativa-Japonica-Group_CM000142.lin.EMBL_RF00028_Intron_gpI_1349111823_57652_12 absent
7/2/2014 7:05:47 AM | RNA World | Computation for task cmsvm2_GA-p[e20-30MB_Lin64f]_1_Drosophila-melanogaster-(fruit-fly)_AE014297.lin.EMBL_RF00028_Intron_gpI_1349111823_13748_9 finished
7/2/2014 7:05:47 AM | RNA World | Output file cmsvm2_GA-p[e20-30MB_Lin64f]_1_Drosophila-melanogaster-(fruit-fly)_AE014297.lin.EMBL_RF00028_Intron_gpI_1349111823_13748_9_0 for task cmsvm2_GA-p[e20-30MB_Lin64f]_1_Drosophila-melanogaster-(fruit-fly)_AE014297.lin.EMBL_RF00028_Intron_gpI_1349111823_13748_9 absent
7/2/2014 7:05:47 AM | RNA World | Output file cmsvm2_GA-p[e20-30MB_Lin64f]_1_Drosophila-melanogaster-(fruit-fly)_AE014297.lin.EMBL_RF00028_Intron_gpI_1349111823_13748_9_1 for task cmsvm2_GA-p[e20-30MB_Lin64f]_1_Drosophila-melanogaster-(fruit-fly)_AE014297.lin.EMBL_RF00028_Intron_gpI_1349111823_13748_9 absent
7/2/2014 7:05:47 AM | Test4Theory@Home | Computation for task wu_1401173805_41511_0 finished
7/2/2014 7:05:47 AM | Test4Theory@Home | Output file wu_1401173805_41511_0_0 for task wu_1401173805_41511_0 absent
7/2/2014 7:05:47 AM | DNA@Home | Requesting new tasks for CPU
7/2/2014 7:06:21 AM | GPUGRID | Started upload of 112-NOELIA_2NDMG-0-3-RND1661_0_0
7/2/2014 7:06:21 AM | GPUGRID | Started upload of 112-NOELIA_2NDMG-0-3-RND1661_0_1
7/2/2014 7:06:21 AM | GPUGRID | Started upload of 112-NOELIA_2NDMG-0-3-RND1661_0_2
7/2/2014 7:06:21 AM | GPUGRID | Started upload of 112-NOELIA_2NDMG-0-3-RND1661_0_3
7/2/2014 7:06:21 AM | GPUGRID | Starting task 4x9x1x13-NOELIA_THROMBIN1-1-3-RND3173_0
7/2/2014 7:06:21 AM | World Community Grid | Starting task E223344_870_K.23.C18FH11N2S2.00957077.2.set1d06_0
7/2/2014 7:06:21 AM | World Community Grid | Starting task E223344_341_K.24.C18FH9N4S.00877499.3.set1d06_0
7/2/2014 7:06:40 AM | GPUGRID | Finished upload of 112-NOELIA_2NDMG-0-3-RND1661_0_0
7/2/2014 7:06:40 AM | GPUGRID | Started upload of 112-NOELIA_2NDMG-0-3-RND1661_0_7
7/2/2014 7:06:42 AM | GPUGRID | Finished upload of 112-NOELIA_2NDMG-0-3-RND1661_0_7
7/2/2014 7:06:42 AM | GPUGRID | Started upload of 112-NOELIA_2NDMG-0-3-RND1661_0_9
...
7/2/2014 7:06:44 AM | GPUGRID | Finished upload of 112-NOELIA_2NDMG-0-3-RND1661_0_3
7/2/2014 7:06:44 AM | GPUGRID | Started upload of 112-NOELIA_2NDMG-0-3-RND1661_0_10
7/2/2014 7:06:47 AM | GPUGRID | Finished upload of 112-NOELIA_2NDMG-0-3-RND1661_0_10
...
7/2/2014 7:07:08 AM | GPUGRID | Finished upload of 112-NOELIA_2NDMG-0-3-RND1661_0_2
...
7/2/2014 7:07:12 AM | GPUGRID | Finished upload of 112-NOELIA_2NDMG-0-3-RND1661_0_1
...
7/2/2014 7:13:53 AM | GPUGRID | Finished upload of 112-NOELIA_2NDMG-0-3-RND1661_0_9
Zuletzt geändert von Jacob Klein am 02.07.2014 21:36, insgesamt 2-mal geändert.

ChristianB
Admin
Admin
Beiträge: 1920
Registriert: 23.02.2010 22:12

Re: Assistance needed - 2 Long-running VMs just failed - Clo

#2 Ungelesener Beitrag von ChristianB » 02.07.2014 14:52

Jacob Klein hat geschrieben:My questions are:
- Do we have any idea what may have triggered this to happen? Maybe did one of them complete?
- Would you like to attempt anything with my clones? I actually have 17 clones of each task, at various completion levels.
- How can we/I prevent this problem in the future?
That's exactly the problem I'm focusing right now. I don't know what's happening here. This also happens with the 1.13 VM app and the logfiles are not helping. It seems that computation is interrupted inside the VM and the VM is shutting down without telling us why.

Can you rerun the latest clone and redirect the shared folder into a separate folder (outside of boinc directory). At the end of the computation there should be two files present inside the shared folder: stdout and out.zip those were missing from the 2 VMs thus the computation error. May guess is that something is "crashing" inside the VM, but it's difficult to find out what.

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Assistance needed - 2 Long-running VMs just failed - Clo

#3 Ungelesener Beitrag von Jacob Klein » 02.07.2014 14:58

Yes, I will attempt to get you more information on this. It may be a couple days (since last snapshot is 2-3 days old).
Would you like a copy of my 2 clones? If you give me a place to send them, I can see if I can package them and upload them to you somehow.

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Assistance needed - 2 Long-running VMs just failed - Clo

#4 Ungelesener Beitrag von Jacob Klein » 02.07.2014 15:11

When I attempt to edit the existing shared folder, or remove the existing shared folder, I get the following error.
Can you give me instructions on how to do this properly?

-----------
Failed to remove the shared folder shared (pointing to C:\ProgramData\BOINC\slots\4/shared) from the virtual machine boinc_8b569e4326cc68b6 Clone 17 - RNA World Clone 1.

Please close all programs in the guest OS that may be using this shared folder and try again.



The machine is not mutable (state is Saved).

Result Code: VBOX_E_INVALID_VM_STATE (0x80BB0002)
Component: SessionMachine
Interface: IMachine {480cf695-2d8d-4256-9c7c-cce4184fa048}

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Assistance needed - 2 Long-running VMs just failed - Clo

#5 Ungelesener Beitrag von Jacob Klein » 02.07.2014 15:15

I figured it out -- I have to ensure the original shared folder exists, then while the VM is running, Edit it to use the new folder location, then save off a snapshot. I'll get both VMs going this way, and let them run, alongside my other BOINC activity, and let you know what I find.

ChristianB
Admin
Admin
Beiträge: 1920
Registriert: 23.02.2010 22:12

Re: Assistance needed - 2 Long-running VMs just failed - Clo

#6 Ungelesener Beitrag von ChristianB » 02.07.2014 21:00

I just noticed that another host turned in two tasks with exit code 0 but no finished files. From the logfiles it seems to me that both tasks where finished at the same time. That's not possible or at least very very unlikely. Can you remember what you where doing at 2014-07-02 07:05:32 ?(I think it's your local time, are you UTC-4?) What I can tell from the logfile is that the vboxheadless.exe was not running anymore and the wrapper found out that the VM powered down (or at least it thought the VM was powered off).

Normally state changes only happen this way:
poweroff to running
running to paused
paused to running

if the state changes from running to poweroff the wrapper assumes that the computation has finished.

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Assistance needed - 2 Long-running VMs just failed - Clo

#7 Ungelesener Beitrag von Jacob Klein » 02.07.2014 21:31

I was away from the machine. Nothing abnormal was running on the machine. I edited the original post to show the stdoutdae.txt entries that were near the failures, but it gives no clue as to what went wrong.

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Assistance needed - 2 Long-running VMs just failed - Clo

#8 Ungelesener Beitrag von Jacob Klein » 02.07.2014 21:33

Taking a second look at it... It seems that I had 3 VMs fail at the same time. 2 big RNA World jobs, and 1 Test4Theory@Home job.
Output files were absent for all 3 VMs.

7/2/2014 7:05:47 AM | | [work_fetch] Request work fetch: application exited
7/2/2014 7:05:47 AM | | [work_fetch] Request work fetch: application exited
7/2/2014 7:05:47 AM | | [work_fetch] Request work fetch: application exited

7/2/2014 7:05:47 AM | RNA World | Computation for task cmsvm2_GA-p[e20-30MB_Lin64f]_1_Oryza-sativa-Japonica-Group_CM000142.lin.EMBL_RF00028_Intron_gpI_1349111823_57652_12 finished
7/2/2014 7:05:47 AM | RNA World | Output file cmsvm2_GA-p[e20-30MB_Lin64f]_1_Oryza-sativa-Japonica-Group_CM000142.lin.EMBL_RF00028_Intron_gpI_1349111823_57652_12_0 for task cmsvm2_GA-p[e20-30MB_Lin64f]_1_Oryza-sativa-Japonica-Group_CM000142.lin.EMBL_RF00028_Intron_gpI_1349111823_57652_12 absent
7/2/2014 7:05:47 AM | RNA World | Output file cmsvm2_GA-p[e20-30MB_Lin64f]_1_Oryza-sativa-Japonica-Group_CM000142.lin.EMBL_RF00028_Intron_gpI_1349111823_57652_12_1 for task cmsvm2_GA-p[e20-30MB_Lin64f]_1_Oryza-sativa-Japonica-Group_CM000142.lin.EMBL_RF00028_Intron_gpI_1349111823_57652_12 absent

7/2/2014 7:05:47 AM | RNA World | Computation for task cmsvm2_GA-p[e20-30MB_Lin64f]_1_Drosophila-melanogaster-(fruit-fly)_AE014297.lin.EMBL_RF00028_Intron_gpI_1349111823_13748_9 finished
7/2/2014 7:05:47 AM | RNA World | Output file cmsvm2_GA-p[e20-30MB_Lin64f]_1_Drosophila-melanogaster-(fruit-fly)_AE014297.lin.EMBL_RF00028_Intron_gpI_1349111823_13748_9_0 for task cmsvm2_GA-p[e20-30MB_Lin64f]_1_Drosophila-melanogaster-(fruit-fly)_AE014297.lin.EMBL_RF00028_Intron_gpI_1349111823_13748_9 absent
7/2/2014 7:05:47 AM | RNA World | Output file cmsvm2_GA-p[e20-30MB_Lin64f]_1_Drosophila-melanogaster-(fruit-fly)_AE014297.lin.EMBL_RF00028_Intron_gpI_1349111823_13748_9_1 for task cmsvm2_GA-p[e20-30MB_Lin64f]_1_Drosophila-melanogaster-(fruit-fly)_AE014297.lin.EMBL_RF00028_Intron_gpI_1349111823_13748_9 absent

7/2/2014 7:05:47 AM | Test4Theory@Home | Computation for task wu_1401173805_41511_0 finished
7/2/2014 7:05:47 AM | Test4Theory@Home | Output file wu_1401173805_41511_0_0 for task wu_1401173805_41511_0 absent

BobCat13
PDA-Benutzer
PDA-Benutzer
Beiträge: 40
Registriert: 17.02.2010 19:33

Re: Assistance needed - 2 Long-running VMs just failed - Clo

#9 Ungelesener Beitrag von BobCat13 » 10.07.2014 04:00

ChristianB hat geschrieben:That's exactly the problem I'm focusing right now. I don't know what's happening here. This also happens with the 1.13 VM app and the logfiles are not helping. It seems that computation is interrupted inside the VM and the VM is shutting down without telling us why.

Can you rerun the latest clone and redirect the shared folder into a separate folder (outside of boinc directory). At the end of the computation there should be two files present inside the shared folder: stdout and out.zip those were missing from the 2 VMs thus the computation error. May guess is that something is "crashing" inside the VM, but it's difficult to find out what.
Don't want to go off topic here, but a question for you. Are the stdout and out.zip files only created upon successful task completion, or should they be present during the entire calculation?

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Assistance needed - 2 Long-running VMs just failed - Clo

#10 Ungelesener Beitrag von Jacob Klein » 10.07.2014 04:25

So far as I know, once the input files are read in by the VM, then... the VM only writes the "progress.txt" file (every minutes or so?), until completion time, at which point the 2 other files are written.

So... For my 2 BOINC tasks that failed, I'm babysitting them (outside of BOINC) (since I had saved off backup copies using Oracle VM VirtualBox Manager), and am watching them write progress.txt (indicating that they are progressing), and awaiting those 2 files per VM.

I've had to work around a couple VirtualBox bugs (where copying the VM resulted in a loss of some of my backup copies), but... other than that, they're behaving, and I'm patiently waiting on results.

ChristianB
Admin
Admin
Beiträge: 1920
Registriert: 23.02.2010 22:12

Re: Assistance needed - 2 Long-running VMs just failed - Clo

#11 Ungelesener Beitrag von ChristianB » 10.07.2014 08:04

Jacob is right. The stdout and out.zip are only present if the computation is finished. Basically is the computation itself fails there should be a stdout file at least. With the problem described here there is no such file which indicates that the VM is terminated by VB and not from the control script.

It also seems not to be a failure with the specific RNA/DNA combination as Jacobs failed tasks are beyond the point of failure right now.

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: Assistance needed - 2 Long-running VMs just failed - Clo

#12 Ungelesener Beitrag von Jacob Klein » 13.07.2014 23:58

Christian,

I just wanted to chime in to say 2 things.

1) The tasks are still being processed without problem. When I want to use my machine for gaming, for both VMs I just right click the VM, and choose "Close -> Save State"... then later, just right-click and choose "Start". Works quite well. For backup purposes, I occasionally save off snapshots (about every 8 hours), and occasionally clone to a new VM (any time I want to change VirtualBox versions or change BOINC versions).

2) I'm able to sort of see that progress is being made, by monitoring the 2 "progress.txt" files. One of them always says "0.98765", but the other increments slowly and is currently at "0.857074". Since this is my only real means to monitor progress (and, presumably BOINC's too, if the tasks were still in BOINC), ..... I'd like to propose that you still increment beyond 0.98765. Even if it's something crazy like "increment 0.00001 every hour", that would be preferred over "never increment anymore". Because, in a couple weeks, both of my files here will say "0.98765", and I'll have no instrinsic way to know that progress is being made. (For instance, if I restore from a snapshot, how will I know that snapshot's progress level, as compared to what I had before restoring? We should have a granular way to indicate this, like 0.00001 hourly). So, please consider increasing progress even in this situation.

Regards,
Jacob

Zurück zu „RNA World Discussions (english)“