New Unit. Strange (maybe?) behavior.

Everything about the project RNA World
Nachricht
Autor
ChristianB
Admin
Admin
Beiträge: 1920
Registriert: 23.02.2010 22:12

Re: New Unit. Strange (maybe?) behavior.

#13 Ungelesener Beitrag von ChristianB » 03.02.2016 18:19

Your stderr.txt output does not look "normal". The forecast could not be calculated because the file was printed in the log and not written to a file so it can be used by the control script. That should not impact the science. What can impact the science is the write errors later on. Can you please post the resultid of this task? I will take a look at the input files. Because of the write issue I would suggest that you abort the task and let someone else try.

Benutzeravatar
Gibson Praise
Mikrocruncher
Mikrocruncher
Beiträge: 16
Registriert: 31.01.2016 01:53
Wohnort: Hiding from the Syndicate
Kontaktdaten:

Re: New Unit. Strange (maybe?) behavior.

#14 Ungelesener Beitrag von Gibson Praise » 04.02.2016 18:58

Michael H.W. Weber hat geschrieben:
Jacob Klein hat geschrieben:Might I recommend updating to the latest version of BOINC? I think I see that you're running 7.6.9, but 7.6.22 is the current recommended version. It might help the progress problem, but it might not. Still worth upgrading.
Maybe Christian should first comment on the question whether an upgrade of BOINC might negatively affect one of these long running tasks? Even I have several extremely long running tasks in progress and use the same "outdated" BOINC version because so far I did not dare to change to a newer one...

Michael.
You are exactly right Michael, that's the fear I have. Even upgrading the client (which I logically know has very little chance of tossing a spanner in the works), much less upgrading vbox from 4 to 5 ( :roll2: ) is just not worth it.

Now if somebody says they have done this successfully .. great! :smoking: I might try it.

But given that this workunit has had 55 unsuccessful attempts so far ( My first wingman died today ) over 2+ years, I'd really like to finish it.
Bild

Benutzeravatar
Gibson Praise
Mikrocruncher
Mikrocruncher
Beiträge: 16
Registriert: 31.01.2016 01:53
Wohnort: Hiding from the Syndicate
Kontaktdaten:

Re: New Unit. Strange (maybe?) behavior.

#15 Ungelesener Beitrag von Gibson Praise » 05.02.2016 06:57

Gibson Praise hat geschrieben:But given that this workunit has had 55 unsuccessful attempts so far ( My first wingman died today ) over 2+ years, I'd really like to finish it.
{ I think quoting yourself on a forum is a sign of serious mental maladjustment .. } :uhoh:

So maladjusted or not I have a question ..

"error while computing" is the most common error listed as terminating a workunit (certainly this workunit). Are there any tips at all on avoiding this dreaded message? :o
Bild

Benutzeravatar
Michael H.W. Weber
Vereinsvorstand
Vereinsvorstand
Beiträge: 22419
Registriert: 07.01.2002 01:00
Wohnort: Marpurk
Kontaktdaten:

Re: New Unit. Strange (maybe?) behavior.

#16 Ungelesener Beitrag von Michael H.W. Weber » 05.02.2016 09:43

Primarily make sure you have enough memory: Each RNA World WU reserves 3 GB and then you need to add the OS requirements (approx. 1 GB for Win 7) plus those that your other activities on that machine require. Everything else so far worked quite stable.

Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

Bild Bild Bild

Xe120
Fingerzähler
Fingerzähler
Beiträge: 2
Registriert: 07.02.2016 13:14

Re: New Unit. Strange (maybe?) behavior.

#17 Ungelesener Beitrag von Xe120 » 07.02.2016 13:24

I have the same thing happening with WU 6330807.
Stuck on 0,1% but it progress in the progress file.
And today i have 0.0000 in the progress file.
Looking to the stderr, i have the read-only error.

Code: Alles auswählen

2016-02-07 11:23:06 (5464): Creating new snapshot for VM.
2016-02-07 11:23:24 (5464): Deleting stale snapshot.
2016-02-07 11:23:24 (5464): Deleting stale snapshot.
2016-02-07 11:23:24 (5464): Deleting stale snapshot.
2016-02-07 11:23:24 (5464): Deleting stale snapshot.
2016-02-07 11:23:24 (5464): Deleting stale snapshot.
2016-02-07 11:23:24 (5464): Checkpoint completed.
2016-02-07 11:24:34 (5464): Guest Log: expr: write error: Read-only file system
2016-02-07 11:24:34 (5464): Guest Log: ./boinc_app: line 93: free_mem.txt: Read-only file system
2016-02-07 11:25:04 (5464): Guest Log: ./boinc_app: line 80: cpu_time_cmd.txt: Read-only file system
I have win10, core i7 4790k, 8GB and boinc 7.6.22 with virtualbox 5.0.10.

What must i do, i abort it or it will be ok if i let it run ?

Benutzeravatar
Gibson Praise
Mikrocruncher
Mikrocruncher
Beiträge: 16
Registriert: 31.01.2016 01:53
Wohnort: Hiding from the Syndicate
Kontaktdaten:

Re: New Unit. Strange (maybe?) behavior.

#18 Ungelesener Beitrag von Gibson Praise » 09.02.2016 08:15

Xe120 hat geschrieben:I have the same thing happening with WU 6330807.
Stuck on 0,1% but it progress in the progress file.
And today i have 0.0000 in the progress file.
Looking to the stderr, i have the read-only error.

Code: Alles auswählen

2016-02-07 11:23:24 (5464): Checkpoint completed.
2016-02-07 11:24:34 (5464): Guest Log: expr: write error: Read-only file system
2016-02-07 11:24:34 (5464): Guest Log: ./boinc_app: line 93: free_mem.txt: Read-only file system
2016-02-07 11:25:04 (5464): Guest Log: ./boinc_app: line 80: cpu_time_cmd.txt: Read-only file system
I have win10, core i7 4790k, 8GB and boinc 7.6.22 with virtualbox 5.0.10.

What must i do, i abort it or it will be ok if i let it run ?
:(
Suspend it at minimum. I suspect that Christian will advise you to abort it, though he may wish to look at the associated logs first.

Wiki
Fingerzähler
Fingerzähler
Beiträge: 1
Registriert: 09.02.2016 22:26

Re: New Unit. Strange (maybe?) behavior.

#19 Ungelesener Beitrag von Wiki » 09.02.2016 22:56

I used to have the same issue, see for instance this extract from my WU 6330801 logfile:

Code: Alles auswählen

2016-01-28 19:59:38 (5768): Status Report: Elapsed Time: '90060.420721'
2016-01-28 19:59:38 (5768): Status Report: CPU Time: '89637.971799'
2016-01-28 20:03:18 (5768): Creating new snapshot for VM.
2016-01-28 20:03:31 (5768): Deleting stale snapshot.
2016-01-28 20:03:31 (5768): Deleting stale snapshot.
2016-01-28 20:03:31 (5768): Deleting stale snapshot.
2016-01-28 20:03:31 (5768): Deleting stale snapshot.
2016-01-28 20:03:31 (5768): Deleting stale snapshot.
2016-01-28 20:03:31 (5768): Checkpoint completed.
2016-01-28 20:04:32 (5768): Guest Log: expr: write error: Read-only file system
2016-01-28 20:04:32 (5768): Guest Log: ./boinc_app: line 93: free_mem.txt: Read-only file system
2016-01-28 20:05:02 (5768): Guest Log: ./boinc_app: line 80: cpu_time_cmd.txt: Read-only file system
2016-01-28 20:05:02 (5768): Guest Log: ./boinc_app: line 93: free_mem.txt: Read-only file system
2016-01-28 20:05:32 (5768): Guest Log: ./boinc_app: line 80: cpu_time_cmd.txt: Read-only file system
2016-01-28 20:05:32 (5768): Guest Log: ./boinc_app: line 93: free_mem.txt: Read-only file system
2016-01-28 20:06:02 (5768): Guest Log: ./boinc_app: line 80: cpu_time_cmd.txt: Read-only file system
2016-01-28 20:06:02 (5768): Guest Log: ./boinc_app: line 93: free_mem.txt: Read-only file system
2016-01-28 20:06:32 (5768): Guest Log: ./boinc_app: line 80: cpu_time_cmd.txt: Read-only file system
2016-01-28 20:06:32 (5768): Guest Log: ./boinc_app: line 93: free_mem.txt: Read-only file system


Seems like the WU was running properly and then suddenly failed (roughly after one day of computation). Trying to suspend and restart the task through boinc is useless since the VM does not actually start (can be checked in virtualbox). Starting the VM directly through virtualbox also fail with a filesystem error on the snapshot.

I have checked on virtualbox bug tracker and it seems that the snapshot feature (which is used by the project to manage the 'checkpoints') is broken on virtualbox 5.0.x. I cant remember the bug ID but it was supposed to be solved in release 5.0.12 ... except that I was using this version and it wasn't the case :o

Downgrading virtualbox to the lastest 4.3 release (4.3.36) seems to have solved to issue so far, I'am currently crunching 2 WU since 8 day without problem :D

As mentioned in previous post the task progress is not reported to boinc (which is stuck at 0.1%) but the 'progress.txt' file is properly updated.

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: New Unit. Strange (maybe?) behavior.

#20 Ungelesener Beitrag von Jacob Klein » 14.02.2016 14:32

For those tracking this thread...

Christian mentioned that he updated the application to v1.17, which includes an updated VBoxWrapper, and a new vbox.job file that should allow progress to report correctly in the UI, instead of constantly reporting 0.1% progress. Prior versions are still valid for getting work done, though, so DON'T abort your 1.14/1.15/1.16 tasks!

Note: If progress appears to get stuck at 98.765%, it actually is still progressing, DON'T abort it - the estimate was just way too low, but your work is still valid, so don't abort it! You can make sure it is progressing because the progress.txt file's modified date should be getting updated, even if the value inside stays at 0.98765. If you want to know an even crazier way to verify progress, you can PM me and I'll explain a riskier technical way.

Thanks,
Jacob

Benutzeravatar
Gibson Praise
Mikrocruncher
Mikrocruncher
Beiträge: 16
Registriert: 31.01.2016 01:53
Wohnort: Hiding from the Syndicate
Kontaktdaten:

Re: New Unit. Strange (maybe?) behavior.

#21 Ungelesener Beitrag von Gibson Praise » 04.04.2016 04:13

I was really hoping I would not have to post again on this thread.

However, this unit persists in being odd. The concern now is that the value in "progress.txt" has now reached "0.999771". From what I have read, it should have stopped and held steady at 0.9875. I suspended the unit. IT was racking up CPU time, regularly updating progress.txt in reasonable increments.

Being a paranoid individual, my concern is what happens when this reaches .999999 or 1.0 or ... well I am not sure what is going to happen :worry: and thought I would wait for comments from Jacob or Christian before plunging madly ahead. :evil2:
Bild

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: New Unit. Strange (maybe?) behavior.

#22 Ungelesener Beitrag von Jacob Klein » 04.04.2016 04:30

Gibson:

Let it crunch, my friend! The way the code logic is written, I believe is that "progress.txt" will show a "6-decimal" readout up to "0.999999", then it resets to a constant "0.98765" until finished. It's not "odd", instead it's just a not-so-robust way to show progress I guess :/ I urged Christian for a better implementation, months ago, but it didn't happen.

So ... as long as it keeps utilizing your CPU, and as long as the progress.txt modified date keeps getting updated, and you aren't getting any errors in the BOINC UI for the task ... then just let it run! Could take a LONG time! Some of my units have been at 0.98765, for 50 days already :)

Here's one of my 1.15 units, which shows the transition, and is still crunching:
Progress: 0d -- 8/13/2015
159d -- 2/7/2016 -- UI 62.7%, progress.txt 0.850912
166d -- 2/14/2016 -- UI 64.253%, progress.txt 0.887791
172d -- 2/20/2016 -- UI 65.530%, progress.txt 0.919677
177d -- 2/26/2016 -- UI 66.648%, progress.txt 0.948510
181.7d -- 3/1/2016 -- UI 67.443%, progress.txt 0.969637
185.9d -- 3/5/2016 -- UI 68.275%, progress.txt 0.991864
192.2d -- 3/12/2016 -- UI 69.492%, progress.txt 0.98765
198.3d -- 3/18/2016 -- UI 70.624%, progress.txt 0.98765
206.4d -- 3/27/2016 -- UI 72.055%, progress.txt 0.98765
212.2d -- 4/2/2016 -- UI 73.042%, progress.txt 0.98765
Here's where I'm tracking my progress, while giving insights into anything neat I find about calculating estimates:
viewtopic.php?f=75&t=16160
... which I think you might like to read.

Since your task
http://www.rnaworld.de/rnaworld/workuni ... id=6330836
... has no wingman that has completed it yet, you're blazing new ground, so keep up the good work -- keep that system stable!

Benutzeravatar
Gibson Praise
Mikrocruncher
Mikrocruncher
Beiträge: 16
Registriert: 31.01.2016 01:53
Wohnort: Hiding from the Syndicate
Kontaktdaten:

Re: New Unit. Strange (maybe?) behavior.

#23 Ungelesener Beitrag von Gibson Praise » 04.04.2016 20:54

Jacob Klein hat geschrieben: Here's where I'm tracking my progress, while giving insights into anything neat I find about calculating estimates:
here
... which I think you might like to read.
I have read it with interest :)
Since your task
6330836
... has no wingman that has completed it yet, you're blazing new ground, so keep up the good work -- keep that system stable!
That gave me a bit of a laugh. The time remaining in the BOINC UI changed a while back from the steady 1306:56:2? hours remaining that it started with to a very firm 87600 hours to go. Just a Leap Day or three under ten years to go. That is a lot of stability! :shocked!:

Edit: It has indeed just kicked over to .98765 .. I'm off to explore unknown quadrants! :good:

Jacob Klein
Brain-Bug
Brain-Bug
Beiträge: 564
Registriert: 26.07.2013 15:41

Re: New Unit. Strange (maybe?) behavior.

#24 Ungelesener Beitrag von Jacob Klein » 04.04.2016 21:04

I estimate that your task will be completed within 4 years of CPU time. That's the best estimate I can give. :)
Realistically, it'll probably be done between 0.5 to 2.0 years.

Antworten

Zurück zu „RNA World Discussions (english)“