All virtual box wu fail?

Everything about the project RNA World
Nachricht
Autor
candido

All virtual box wu fail?

#1 Ungelesener Beitrag von candido » 13.11.2013 21:49

I had 2 RNA VB WUs running for 400 h each with no problems, a alongside with 1 test4theory and 1 climateathome WUs.
Climateathome completed succefully two wu, and test4theory completede several (25 so far).
Recently i decided to reboot the machine and made sure that all VB WU stoped before reboot.
After reboot everything restarded fine.
The two RNA WUs run for one more day, and then one failed.
The other RNA Wu failed the next day.
I have done some research and of all users i looked at, only one completed VB WUs.
I am not inclined to run any more of these WU
Still running on another computer one of those XXL, they seem to be more reliable, even if they dont checkpoint.
I finished the one I was running but none of the the 2 VB.


copied this for Stderr output from one wu
196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED

<core_client_version>7.0.64</core_client_version>
<![CDATA[
<message>
Maximum disk usage exceeded
</message>
<stderr_txt>
a\BOINC\slots\8\vm_image.vdi" (cb=42)



and this from the other
196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED

<core_client_version>7.0.64</core_client_version>
<![CDATA[
<message>
Maximum disk usage exceeded
</message>
<stderr_txt>
<integer> = 0x0000000000000000 (0)

Machine is a
i7 2630 M
Win 7 64 bits
BM 7.0.64
VB 4.2.2

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: All virtual box wu fail?

#2 Ungelesener Beitrag von Jacob Klein » 14.11.2013 04:19

Candido:

Would you mind changing your settings so that your computers are not hidden? That way we can more-easily view your tasks/results/problems.

Also, I'd like to share what I know.

The errors "Maximum disk usage exceeded" and "EXIT_DISK_LIMIT_EXCEEDED" are related to hitting a disk space limit. So either
a) you've configured BOINC to limit by disk space and it ran out
or
b) the task's pre-configured <rsc_disk_bound> configuration, which limits it by saying "If the task takes up more than this much disk space, consider the task a failure.", was exceeded.

I looked at my RNA World VM task's configuration for this (by carefully opening client_state.xml, and carefully searching for the rsc_disk_bound entry that matched the RNA World VM task)
Mine is set to:
<rsc_disk_bound>6000000000.000000</rsc_disk_bound>
... which is 6,000,000,000 bytes, or 5.588 GB.

Also, I've noticed that stderr.txt file (in the task's slot directory) can grow quite large (possibly unbounded?) for these VM tasks. I read another post about it (here: viewtopic.php?f=75&t=13127#p143557), and also have noticed for my task that, after 272 hours, my stderr.txt file is 220 MB.

I took a peek at the file and here's what I found:
Sometimes (not all the time, just sometimes, maybe after being paused/resumed?), every 10 minutes when a snapshot is taken, I see *3* "CFGM dump" blocks that look like:
15:05:02.811139 ************************* CFGM dump *************************
...
... 50 lines of stuff
...
15:05:02.811180 ********************* End of CFGM dump **********************

So, that's something like 150-160 lines, every 10 minutes, for a lot of the time within the log file.

My task (http://www.rnaworld.de/rnaworld/result. ... d=14921181)
... is being processed using an older version of the application: cmsearch VM (VirtualBox) 1.0.2 v1.03 (vbox64)
... and I think Christian said he may have found something which MIGHT fix this problem for tasks that are worked in the new application: v1.06 (vbox64)

I'm a little concerned for my task, and will be monitoring to make sure that the size of the slot directory doesn't approach that 5.588 GB limit.

Questions I'd like to see answered (by Christian) are:
- If I make sure it stays below 5.59 GB, and the task finishes, will we be good?
- Upon completion, does it only upload part of the stderr.txt file, hopefully, or does it try to upload the whole thing?
- Should I be doing anything, like manipulating/clearing that file while BOINC is closed, in order to keep it small?
- Is there a fix for this problem?

Regards,
Jacob

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: All virtual box wu fail?

#3 Ungelesener Beitrag von Jacob Klein » 14.11.2013 15:58

Here is some more information I received in private communications with Christian, regarding that "CFGM dump" and the stderr.txt size:

He said:
That's the problem that leads to the big stderr.txt. In order to keep it under the 6GB limit you either have to prune the stderr.txt from time to time and/or increase the rsc_disk_bound in your client_state.xml to 10GB. The pruning can be done while BOINC is running the client_state.xml should only be changed when BOINC is closed. This should make sure that the task can run until finished.

I said:
If I don't prune it, and it is able to complete successfully under the 5.58GB limit, will it try to upload the whole stderr.txt file?

He said:
No it just sends the last 16kb or so. That's not a problem.

-------------------------------------------------

Long story short, make sure your slot directory stays below 5.58 GB, and then when the task completes it should upload successfully.
If you are approaching the size limit, to be on the safe side, you should prune the stderr.txt file (maybe replace it with a 0-byte file)

Regards,
Jacob

ChristianB
Admin
Admin
Beiträge: 1920
Registriert: 23.02.2010 22:12

Re: All virtual box wu fail?

#4 Ungelesener Beitrag von ChristianB » 14.11.2013 17:57

The EXIT_DISK_LIMIT_EXCEEDED error for those older tasks is caused in part by the growing stderr.txt which by itself is caused by the old vboxwrapper version and some buggy input files. With a newer vboxwrapper and repaired inputfiles this kind of error shouldn't happen anymore. Because of some BOINC restrictions I can't cancel and abort the task from the server. I can cancel unstarted tasks but not already running ones. So it's ok to abort all older cmsearch VM apps so I can repair the input files and reissue new tasks with the new cmsearch VM application version. This version is much more stable thanks to users like you, who contribute to this project. If you want some credits for those aborted jobs you may tell me so via PM and I see that we add some to your account.

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 22996
Registriert: 07.01.2002 01:00
Wohnort: Marpurk

Re: All virtual box wu fail?

#5 Ungelesener Beitrag von Michael H.W. Weber » 16.11.2013 13:16

Sure you run the latest RNA World VM apps? Please, reset the system to make sure. This whole story reminds me of a problem that I ecnountered weeks ago before we introduced a number of fixes.

Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

Bild Bild

candido

Re: All virtual box wu fail?

#6 Ungelesener Beitrag von candido » 17.11.2013 16:42

Thanks
I have made a reset to the project
In fact, it seems I was running an old version: cmsearch VM (VirtualBox) 1.0.2 v1.03 (vbox64)
I checked aplications page and the new one seems to be v1.06
Now I just need a WU (and a bit of luck)
candido

candido

Re: All virtual box wu fail?

#7 Ungelesener Beitrag von candido » 17.11.2013 20:10

Jacob Klein hat geschrieben:Candido:

The errors "Maximum disk usage exceeded" and "EXIT_DISK_LIMIT_EXCEEDED" are related to hitting a disk space limit. So either
a) you've configured BOINC to limit by disk space and it ran out
or
b) the task's pre-configured <rsc_disk_bound> configuration, which limits it by saying "If the task takes up more than this much disk space, consider the task a failure.", was exceeded.

I looked at my RNA World VM task's configuration for this (by carefully opening client_state.xml, and carefully searching for the rsc_disk_bound entry that matched the RNA World VM task)
Mine is set to:
<rsc_disk_bound>6000000000.000000</rsc_disk_bound>
... which is 6,000,000,000 bytes, or 5.588 GB.

Jacob

Hi Jacob
Thanks for the info you provided
I have (and had previously) 25 GB free for boinc to use. It is using currently approx 10 GB.
I looked into client_state.xml and found several lines like that for other projects
But there is no such line for RNA
Does that mean there is no limit to the size of the file?
Or does it mean my client_state.xml is broken?
By the way, is it ok to open client_state.xml when Boinc manager is running (with extreme care not to change anything)?
candido

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: All virtual box wu fail?

#8 Ungelesener Beitrag von Jacob Klein » 17.11.2013 23:48

candido hat geschrieben:
Jacob Klein hat geschrieben:Candido:

The errors "Maximum disk usage exceeded" and "EXIT_DISK_LIMIT_EXCEEDED" are related to hitting a disk space limit. So either
a) you've configured BOINC to limit by disk space and it ran out
or
b) the task's pre-configured <rsc_disk_bound> configuration, which limits it by saying "If the task takes up more than this much disk space, consider the task a failure.", was exceeded.

I looked at my RNA World VM task's configuration for this (by carefully opening client_state.xml, and carefully searching for the rsc_disk_bound entry that matched the RNA World VM task)
Mine is set to:
<rsc_disk_bound>6000000000.000000</rsc_disk_bound>
... which is 6,000,000,000 bytes, or 5.588 GB.

Jacob

Hi Jacob
Thanks for the info you provided
I have (and had previously) 25 GB free for boinc to use. It is using currently approx 10 GB.
I looked into client_state.xml and found several lines like that for other projects
But there is no such line for RNA
Does that mean there is no limit to the size of the file?
Or does it mean my client_state.xml is broken?
By the way, is it ok to open client_state.xml when Boinc manager is running (with extreme care not to change anything)?
candido
The line should be in there somewhere - were you searching for "rsc_disk_bound"?
Anyway, I believe client_state.xml is safe to open, even while BOINC is running, so long as you do NOT make any changes.
Regarding your tasks, I'm betting the slot directory just grew in size (due to the stderr.txt bug that's supposedly fixed in v1.06), and once they hit that 5.59GB limit, they error'd out.
Can you please edit your RNA World project preferences to show your computers? Currently they are hidden :(

Thanks,
Jacob

candido

Re: All virtual box wu fail?

#9 Ungelesener Beitrag von candido » 18.11.2013 00:17

The line should be in there somewhere - were you searching for "rsc_disk_bound"?
Anyway, I believe client_state.xml is safe to open, even while BOINC is running, so long as you do NOT make any changes.
Regarding your tasks, I'm betting the slot directory just grew in size (due to the stderr.txt bug that's supposedly fixed in v1.06), and once they hit that 5.59GB limit, they error'd out.
Can you please edit your RNA World project preferences to show your computers? Currently they are hidden :(

Thanks,
Jacob
Hi, you can see my computers now.
I searched for that exact expression and found it, several times, but not for RNA...

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: All virtual box wu fail?

#10 Ungelesener Beitrag von Jacob Klein » 18.11.2013 01:01

It would be within a block for one of your RNA tasks (workunits), for app_name cmsearch3.

For instance, mine was:

<workunit>
<name>cmsvm_GA-p[e20-30MB_Lin64f]_1_Drosophila-simulans_CM000364.lin.EMBL_RF00177_SSU_rRNA_5_1349111823_24869</name>
<app_name>cmsearch3</app_name>
<version_num>103</version_num>
<rsc_fpops_est>13901257439556300.000000</rsc_fpops_est>
<rsc_fpops_bound>200000000000000000.000000</rsc_fpops_bound>
<rsc_memory_bound>4000000000.000000</rsc_memory_bound>
<rsc_disk_bound>6000000000.000000</rsc_disk_bound>
<command_line>
--trickle 14400
</command_line>
<file_ref>
<file_name>cmsvm_GA-p[e20-30MB_Lin64f]_1_Drosophila-simulans_CM000364.lin.EMBL_RF00177_SSU_rRNA_5_1349111823_24869-in.zip</file_name>
<open_name>in.zip</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>cmsvm_GA-p[e20-30MB_Lin64f]_1_Drosophila-simulans_CM000364.lin.EMBL_RF00177_SSU_rRNA_5_1349111823_24869-para.zip</file_name>
<open_name>para.zip</open_name>
<copy_file/>
</file_ref>
</workunit>

candido

Re: All virtual box wu fail?

#11 Ungelesener Beitrag von candido » 18.11.2013 02:10

I only have this

</code_sign_key>
</project>
<project>
<master_url>http://www.rnaworld.de/rnaworld/</master_url>
<project_name>RNA World</project_name>
<symstore></symstore>
<user_name>candido</user_name>
<team_name>Portugal@Home</team_name>
<host_venue></host_venue>
<email_hash>...</email_hash>
<cross_project_id>...</cross_project_id>
<cpid_time>1381200180.000000</cpid_time>
<user_total_credit>15472.128290</user_total_credit>
<user_expavg_credit>219.144989</user_expavg_credit>
<user_create_time>1381200180.000000</user_create_time>
<rpc_seqno>343</rpc_seqno>
<userid>10455</userid>
<teamid>702</teamid>
<hostid>27211</hostid>
<host_total_credit>0.000000</host_total_credit>
<host_expavg_credit>0.000000</host_expavg_credit>
<host_create_time>1381201716.000000</host_create_time>
<nrpc_failures>0</nrpc_failures>
<master_fetch_failures>0</master_fetch_failures>
<min_rpc_time>1384663376.487053</min_rpc_time>
<next_rpc_time>1384681315.487053</next_rpc_time>
<rec>497.866455</rec>
<rec_time>1384736445.178345</rec_time>
<resource_share>100.000000</resource_share>
<desired_disk_usage>0.000000</desired_disk_usage>
<duration_correction_factor>1.000000</duration_correction_factor>
<sched_rpc_pending>0</sched_rpc_pending>
<send_time_stats_log>0</send_time_stats_log>
<send_job_log>0</send_job_log>
<dont_use_dcf/>
<verify_files_on_app_start/>
<dont_request_more_work/>
<attached_via_acct_mgr/>
<rsc_backoff_time>
<name>CPU</name>
<value>0.000000</value>
</rsc_backoff_time>
<rsc_backoff_interval>
<name>CPU</name>
<value>0.000000</value>
</rsc_backoff_interval>
<rsc_backoff_time>
<name>NVIDIA</name>
<value>0.000000</value>
</rsc_backoff_time>
<rsc_backoff_interval>
<name>NVIDIA</name>
<value>0.000000</value>
</rsc_backoff_interval>
<no_rsc_apps>NVIDIA</no_rsc_apps>
<scheduler_url>http://www.rnaworld.de/rnaworld_cgi/cgi</scheduler_url>
<code_sign_key>


There's some strange numbers, and then this

</code_sign_key>
</project>
<file>
<name>stat_icon_01.png</name>
<nbytes>0.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<md5_cksum>fd317e48057a4f587b2a8b02a3a9aea4</md5_cksum>
<status>1</status>
<download_url>http://www.rnaworld.de/rnaworld/downloa ... wnload_url>
</file>
<file>
<name>rkn.png</name>
<nbytes>0.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<md5_cksum>1940d60abe4405e2c32cabf71c45dcdb</md5_cksum>
<status>1</status>
<download_url>http://www.rnaworld.de/rnaworld/downloa ... wnload_url>
</file>
<file>
<name>MW-6SRNAFragment_mini_txt.png</name>
<nbytes>0.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<md5_cksum>54eeed67199dd327c34d44a390df2190</md5_cksum>
<status>1</status>
<download_url>http://www.rnaworld.de/rnaworld/downloa ... wnload_url>
</file>
<file>
<name>PDB-1EGK_PMID-10864501_FourWayJunction_mini_txt.png</name>
<nbytes>0.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<md5_cksum>f2c81be77ca723b7f9eabdaa10b19651</md5_cksum>
<status>1</status>
<download_url>http://www.rnaworld.de/rnaworld/downloa ... wnload_url>
</file>
<file>
<name>PDB-1Q8N_PMID-10617571_MalachiteGreenAptamer_mini_txt.png</name>
<nbytes>0.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<md5_cksum>7f10da7a04944d69e15583dcdeb2ee1a</md5_cksum>
<status>1</status>
<download_url>http://www.rnaworld.de/rnaworld/downloa ... wnload_url>
</file>
<file>
<name>PDB-1SJ3_PMID-15141216_HDVRibozyme_mini_txt.png</name>
<nbytes>0.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<md5_cksum>6988a9452855404c01e09f0642e1f800</md5_cksum>
<status>1</status>
<download_url>http://www.rnaworld.de/rnaworld/downloa ... wnload_url>
</file>
<file>
<name>PDB-1U9S_PMID-15459389_S-DomainA-TypeRNaseP_mini_txt.png</name>
<nbytes>0.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<md5_cksum>a3ca447bfda28be98cb50a6aba14074f</md5_cksum>
<status>1</status>
<download_url>http://www.rnaworld.de/rnaworld/downloa ... wnload_url>
</file>
<file>
<name>PDB-2A2E_PMID-16113684_A-typeRNaseP_mini_txt.png</name>
<nbytes>0.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<md5_cksum>6db33aa67e5bebb0b401d235cfb27ea6</md5_cksum>
<status>1</status>
<download_url>http://www.rnaworld.de/rnaworld/downloa ... wnload_url>
</file>
<file>
<name>PDB-2NZ4_PMID-17196404_GlmS-GlcN6P_mini_txt.png</name>
<nbytes>0.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<md5_cksum>6e086f4176a36365f1232852f00d1d14</md5_cksum>
<status>1</status>
<download_url>http://www.rnaworld.de/rnaworld/downloa ... wnload_url>
</file>
<project_files>
<file_ref>
<file_name>stat_icon_01.png</file_name>
<open_name>stat_icon</open_name>
</file_ref>
<file_ref>
<file_name>rkn.png</file_name>
<open_name>slideshow_01</open_name>
</file_ref>
<file_ref>
<file_name>MW-6SRNAFragment_mini_txt.png</file_name>
<open_name>slideshow_02</open_name>
</file_ref>
<file_ref>
<file_name>PDB-1EGK_PMID-10864501_FourWayJunction_mini_txt.png</file_name>
<open_name>slideshow_03</open_name>
</file_ref>
<file_ref>
<file_name>PDB-1Q8N_PMID-10617571_MalachiteGreenAptamer_mini_txt.png</file_name>
<open_name>slideshow_04</open_name>
</file_ref>
<file_ref>
<file_name>PDB-1SJ3_PMID-15141216_HDVRibozyme_mini_txt.png</file_name>
<open_name>slideshow_05</open_name>
</file_ref>
<file_ref>
<file_name>PDB-1U9S_PMID-15459389_S-DomainA-TypeRNaseP_mini_txt.png</file_name>
<open_name>slideshow_06</open_name>
</file_ref>
<file_ref>
<file_name>PDB-2A2E_PMID-16113684_A-typeRNaseP_mini_txt.png</file_name>
<open_name>slideshow_07</open_name>
</file_ref>
<file_ref>
<file_name>PDB-2NZ4_PMID-17196404_GlmS-GlcN6P_mini_txt.png</file_name>
<open_name>slideshow_08</open_name>
</file_ref>
</project_files>

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: All virtual box wu fail?

#12 Ungelesener Beitrag von Jacob Klein » 18.11.2013 02:26

You need to be looking for a <workunit> block, that starts with <workunit> and ends with </workunit>.
Within that <workunit> block, there should be a line that says:
<app_name>cmsearch3</app_name>
... and also a line that defines the <rsc_disk_bound>

It really doesn't matter, though.

Just make sure that your slot directory stays below 5.58 GB.
Presumably if you are running a v1.06 task, then the problem (erroneously writing VM info to stderr.txt) will have been fixed anyway.

Zurück zu „RNA World Discussions (english)“