New Unit. Strange (maybe?) behavior.
-
- Admin
- Beiträge: 1920
- Registriert: 23.02.2010 22:12
Re: New Unit. Strange (maybe?) behavior.
Your stderr.txt output does not look "normal". The forecast could not be calculated because the file was printed in the log and not written to a file so it can be used by the control script. That should not impact the science. What can impact the science is the write errors later on. Can you please post the resultid of this task? I will take a look at the input files. Because of the write issue I would suggest that you abort the task and let someone else try.
- Gibson Praise
- Mikrocruncher
- Beiträge: 16
- Registriert: 31.01.2016 01:53
- Wohnort: Hiding from the Syndicate
- Kontaktdaten:
Re: New Unit. Strange (maybe?) behavior.
You are exactly right Michael, that's the fear I have. Even upgrading the client (which I logically know has very little chance of tossing a spanner in the works), much less upgrading vbox from 4 to 5 ( ) is just not worth it.Michael H.W. Weber hat geschrieben:Maybe Christian should first comment on the question whether an upgrade of BOINC might negatively affect one of these long running tasks? Even I have several extremely long running tasks in progress and use the same "outdated" BOINC version because so far I did not dare to change to a newer one...Jacob Klein hat geschrieben:Might I recommend updating to the latest version of BOINC? I think I see that you're running 7.6.9, but 7.6.22 is the current recommended version. It might help the progress problem, but it might not. Still worth upgrading.
Michael.
Now if somebody says they have done this successfully .. great! I might try it.
But given that this workunit has had 55 unsuccessful attempts so far ( My first wingman died today ) over 2+ years, I'd really like to finish it.
- Gibson Praise
- Mikrocruncher
- Beiträge: 16
- Registriert: 31.01.2016 01:53
- Wohnort: Hiding from the Syndicate
- Kontaktdaten:
Re: New Unit. Strange (maybe?) behavior.
{ I think quoting yourself on a forum is a sign of serious mental maladjustment .. }Gibson Praise hat geschrieben:But given that this workunit has had 55 unsuccessful attempts so far ( My first wingman died today ) over 2+ years, I'd really like to finish it.
So maladjusted or not I have a question ..
"error while computing" is the most common error listed as terminating a workunit (certainly this workunit). Are there any tips at all on avoiding this dreaded message?
- Michael H.W. Weber
- Vereinsvorstand
- Beiträge: 22419
- Registriert: 07.01.2002 01:00
- Wohnort: Marpurk
- Kontaktdaten:
Re: New Unit. Strange (maybe?) behavior.
Primarily make sure you have enough memory: Each RNA World WU reserves 3 GB and then you need to add the OS requirements (approx. 1 GB for Win 7) plus those that your other activities on that machine require. Everything else so far worked quite stable.
Michael.
Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.
http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B
http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B
Re: New Unit. Strange (maybe?) behavior.
I have the same thing happening with WU 6330807.
Stuck on 0,1% but it progress in the progress file.
And today i have 0.0000 in the progress file.
Looking to the stderr, i have the read-only error.
I have win10, core i7 4790k, 8GB and boinc 7.6.22 with virtualbox 5.0.10.
What must i do, i abort it or it will be ok if i let it run ?
Stuck on 0,1% but it progress in the progress file.
And today i have 0.0000 in the progress file.
Looking to the stderr, i have the read-only error.
Code: Alles auswählen
2016-02-07 11:23:06 (5464): Creating new snapshot for VM.
2016-02-07 11:23:24 (5464): Deleting stale snapshot.
2016-02-07 11:23:24 (5464): Deleting stale snapshot.
2016-02-07 11:23:24 (5464): Deleting stale snapshot.
2016-02-07 11:23:24 (5464): Deleting stale snapshot.
2016-02-07 11:23:24 (5464): Deleting stale snapshot.
2016-02-07 11:23:24 (5464): Checkpoint completed.
2016-02-07 11:24:34 (5464): Guest Log: expr: write error: Read-only file system
2016-02-07 11:24:34 (5464): Guest Log: ./boinc_app: line 93: free_mem.txt: Read-only file system
2016-02-07 11:25:04 (5464): Guest Log: ./boinc_app: line 80: cpu_time_cmd.txt: Read-only file system
What must i do, i abort it or it will be ok if i let it run ?
- Gibson Praise
- Mikrocruncher
- Beiträge: 16
- Registriert: 31.01.2016 01:53
- Wohnort: Hiding from the Syndicate
- Kontaktdaten:
Re: New Unit. Strange (maybe?) behavior.
Xe120 hat geschrieben:I have the same thing happening with WU 6330807.
Stuck on 0,1% but it progress in the progress file.
And today i have 0.0000 in the progress file.
Looking to the stderr, i have the read-only error.
I have win10, core i7 4790k, 8GB and boinc 7.6.22 with virtualbox 5.0.10.Code: Alles auswählen
2016-02-07 11:23:24 (5464): Checkpoint completed. 2016-02-07 11:24:34 (5464): Guest Log: expr: write error: Read-only file system 2016-02-07 11:24:34 (5464): Guest Log: ./boinc_app: line 93: free_mem.txt: Read-only file system 2016-02-07 11:25:04 (5464): Guest Log: ./boinc_app: line 80: cpu_time_cmd.txt: Read-only file system
What must i do, i abort it or it will be ok if i let it run ?
Suspend it at minimum. I suspect that Christian will advise you to abort it, though he may wish to look at the associated logs first.
Re: New Unit. Strange (maybe?) behavior.
I used to have the same issue, see for instance this extract from my WU 6330801 logfile:
Seems like the WU was running properly and then suddenly failed (roughly after one day of computation). Trying to suspend and restart the task through boinc is useless since the VM does not actually start (can be checked in virtualbox). Starting the VM directly through virtualbox also fail with a filesystem error on the snapshot.
I have checked on virtualbox bug tracker and it seems that the snapshot feature (which is used by the project to manage the 'checkpoints') is broken on virtualbox 5.0.x. I cant remember the bug ID but it was supposed to be solved in release 5.0.12 ... except that I was using this version and it wasn't the case
Downgrading virtualbox to the lastest 4.3 release (4.3.36) seems to have solved to issue so far, I'am currently crunching 2 WU since 8 day without problem
As mentioned in previous post the task progress is not reported to boinc (which is stuck at 0.1%) but the 'progress.txt' file is properly updated.
Code: Alles auswählen
2016-01-28 19:59:38 (5768): Status Report: Elapsed Time: '90060.420721'
2016-01-28 19:59:38 (5768): Status Report: CPU Time: '89637.971799'
2016-01-28 20:03:18 (5768): Creating new snapshot for VM.
2016-01-28 20:03:31 (5768): Deleting stale snapshot.
2016-01-28 20:03:31 (5768): Deleting stale snapshot.
2016-01-28 20:03:31 (5768): Deleting stale snapshot.
2016-01-28 20:03:31 (5768): Deleting stale snapshot.
2016-01-28 20:03:31 (5768): Deleting stale snapshot.
2016-01-28 20:03:31 (5768): Checkpoint completed.
2016-01-28 20:04:32 (5768): Guest Log: expr: write error: Read-only file system
2016-01-28 20:04:32 (5768): Guest Log: ./boinc_app: line 93: free_mem.txt: Read-only file system
2016-01-28 20:05:02 (5768): Guest Log: ./boinc_app: line 80: cpu_time_cmd.txt: Read-only file system
2016-01-28 20:05:02 (5768): Guest Log: ./boinc_app: line 93: free_mem.txt: Read-only file system
2016-01-28 20:05:32 (5768): Guest Log: ./boinc_app: line 80: cpu_time_cmd.txt: Read-only file system
2016-01-28 20:05:32 (5768): Guest Log: ./boinc_app: line 93: free_mem.txt: Read-only file system
2016-01-28 20:06:02 (5768): Guest Log: ./boinc_app: line 80: cpu_time_cmd.txt: Read-only file system
2016-01-28 20:06:02 (5768): Guest Log: ./boinc_app: line 93: free_mem.txt: Read-only file system
2016-01-28 20:06:32 (5768): Guest Log: ./boinc_app: line 80: cpu_time_cmd.txt: Read-only file system
2016-01-28 20:06:32 (5768): Guest Log: ./boinc_app: line 93: free_mem.txt: Read-only file system
Seems like the WU was running properly and then suddenly failed (roughly after one day of computation). Trying to suspend and restart the task through boinc is useless since the VM does not actually start (can be checked in virtualbox). Starting the VM directly through virtualbox also fail with a filesystem error on the snapshot.
I have checked on virtualbox bug tracker and it seems that the snapshot feature (which is used by the project to manage the 'checkpoints') is broken on virtualbox 5.0.x. I cant remember the bug ID but it was supposed to be solved in release 5.0.12 ... except that I was using this version and it wasn't the case
Downgrading virtualbox to the lastest 4.3 release (4.3.36) seems to have solved to issue so far, I'am currently crunching 2 WU since 8 day without problem
As mentioned in previous post the task progress is not reported to boinc (which is stuck at 0.1%) but the 'progress.txt' file is properly updated.
-
- Brain-Bug
- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: New Unit. Strange (maybe?) behavior.
For those tracking this thread...
Christian mentioned that he updated the application to v1.17, which includes an updated VBoxWrapper, and a new vbox.job file that should allow progress to report correctly in the UI, instead of constantly reporting 0.1% progress. Prior versions are still valid for getting work done, though, so DON'T abort your 1.14/1.15/1.16 tasks!
Note: If progress appears to get stuck at 98.765%, it actually is still progressing, DON'T abort it - the estimate was just way too low, but your work is still valid, so don't abort it! You can make sure it is progressing because the progress.txt file's modified date should be getting updated, even if the value inside stays at 0.98765. If you want to know an even crazier way to verify progress, you can PM me and I'll explain a riskier technical way.
Thanks,
Jacob
Christian mentioned that he updated the application to v1.17, which includes an updated VBoxWrapper, and a new vbox.job file that should allow progress to report correctly in the UI, instead of constantly reporting 0.1% progress. Prior versions are still valid for getting work done, though, so DON'T abort your 1.14/1.15/1.16 tasks!
Note: If progress appears to get stuck at 98.765%, it actually is still progressing, DON'T abort it - the estimate was just way too low, but your work is still valid, so don't abort it! You can make sure it is progressing because the progress.txt file's modified date should be getting updated, even if the value inside stays at 0.98765. If you want to know an even crazier way to verify progress, you can PM me and I'll explain a riskier technical way.
Thanks,
Jacob
- Gibson Praise
- Mikrocruncher
- Beiträge: 16
- Registriert: 31.01.2016 01:53
- Wohnort: Hiding from the Syndicate
- Kontaktdaten:
Re: New Unit. Strange (maybe?) behavior.
I was really hoping I would not have to post again on this thread.
However, this unit persists in being odd. The concern now is that the value in "progress.txt" has now reached "0.999771". From what I have read, it should have stopped and held steady at 0.9875. I suspended the unit. IT was racking up CPU time, regularly updating progress.txt in reasonable increments.
Being a paranoid individual, my concern is what happens when this reaches .999999 or 1.0 or ... well I am not sure what is going to happen and thought I would wait for comments from Jacob or Christian before plunging madly ahead.
However, this unit persists in being odd. The concern now is that the value in "progress.txt" has now reached "0.999771". From what I have read, it should have stopped and held steady at 0.9875. I suspended the unit. IT was racking up CPU time, regularly updating progress.txt in reasonable increments.
Being a paranoid individual, my concern is what happens when this reaches .999999 or 1.0 or ... well I am not sure what is going to happen and thought I would wait for comments from Jacob or Christian before plunging madly ahead.
-
- Brain-Bug
- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: New Unit. Strange (maybe?) behavior.
Gibson:
Let it crunch, my friend! The way the code logic is written, I believe is that "progress.txt" will show a "6-decimal" readout up to "0.999999", then it resets to a constant "0.98765" until finished. It's not "odd", instead it's just a not-so-robust way to show progress I guess :/ I urged Christian for a better implementation, months ago, but it didn't happen.
So ... as long as it keeps utilizing your CPU, and as long as the progress.txt modified date keeps getting updated, and you aren't getting any errors in the BOINC UI for the task ... then just let it run! Could take a LONG time! Some of my units have been at 0.98765, for 50 days already
Here's one of my 1.15 units, which shows the transition, and is still crunching:
viewtopic.php?f=75&t=16160
... which I think you might like to read.
Since your task
http://www.rnaworld.de/rnaworld/workuni ... id=6330836
... has no wingman that has completed it yet, you're blazing new ground, so keep up the good work -- keep that system stable!
Let it crunch, my friend! The way the code logic is written, I believe is that "progress.txt" will show a "6-decimal" readout up to "0.999999", then it resets to a constant "0.98765" until finished. It's not "odd", instead it's just a not-so-robust way to show progress I guess :/ I urged Christian for a better implementation, months ago, but it didn't happen.
So ... as long as it keeps utilizing your CPU, and as long as the progress.txt modified date keeps getting updated, and you aren't getting any errors in the BOINC UI for the task ... then just let it run! Could take a LONG time! Some of my units have been at 0.98765, for 50 days already
Here's one of my 1.15 units, which shows the transition, and is still crunching:
Here's where I'm tracking my progress, while giving insights into anything neat I find about calculating estimates:Progress: 0d -- 8/13/2015
159d -- 2/7/2016 -- UI 62.7%, progress.txt 0.850912
166d -- 2/14/2016 -- UI 64.253%, progress.txt 0.887791
172d -- 2/20/2016 -- UI 65.530%, progress.txt 0.919677
177d -- 2/26/2016 -- UI 66.648%, progress.txt 0.948510
181.7d -- 3/1/2016 -- UI 67.443%, progress.txt 0.969637
185.9d -- 3/5/2016 -- UI 68.275%, progress.txt 0.991864
192.2d -- 3/12/2016 -- UI 69.492%, progress.txt 0.98765
198.3d -- 3/18/2016 -- UI 70.624%, progress.txt 0.98765
206.4d -- 3/27/2016 -- UI 72.055%, progress.txt 0.98765
212.2d -- 4/2/2016 -- UI 73.042%, progress.txt 0.98765
viewtopic.php?f=75&t=16160
... which I think you might like to read.
Since your task
http://www.rnaworld.de/rnaworld/workuni ... id=6330836
... has no wingman that has completed it yet, you're blazing new ground, so keep up the good work -- keep that system stable!
- Gibson Praise
- Mikrocruncher
- Beiträge: 16
- Registriert: 31.01.2016 01:53
- Wohnort: Hiding from the Syndicate
- Kontaktdaten:
Re: New Unit. Strange (maybe?) behavior.
I have read it with interestJacob Klein hat geschrieben: Here's where I'm tracking my progress, while giving insights into anything neat I find about calculating estimates:
here
... which I think you might like to read.
That gave me a bit of a laugh. The time remaining in the BOINC UI changed a while back from the steady 1306:56:2? hours remaining that it started with to a very firm 87600 hours to go. Just a Leap Day or three under ten years to go. That is a lot of stability!Since your task
6330836
... has no wingman that has completed it yet, you're blazing new ground, so keep up the good work -- keep that system stable!
Edit: It has indeed just kicked over to .98765 .. I'm off to explore unknown quadrants!
-
- Brain-Bug
- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: New Unit. Strange (maybe?) behavior.
I estimate that your task will be completed within 4 years of CPU time. That's the best estimate I can give.
Realistically, it'll probably be done between 0.5 to 2.0 years.
Realistically, it'll probably be done between 0.5 to 2.0 years.