Assistance needed - 2 Long-running VMs just failed - Clones
-
ChristianB
- Admin

- Beiträge: 1920
- Registriert: 23.02.2010 22:12
Re: Assistance needed - 2 Long-running VMs just failed - Clo
You can still monitor the modification timestamp of progress.txt. This should still change although the value stays the same.
I'll see what I can implement with the next update of the control script.
But what your experiment shows is that this particular error seems to be caused not by the VM itself but by some other still unknown reason.
I'll see what I can implement with the next update of the control script.
But what your experiment shows is that this particular error seems to be caused not by the VM itself but by some other still unknown reason.
-
Jacob Klein
- Brain-Bug

- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: Assistance needed - 2 Long-running VMs just failed - Clo
Yes, I know I can monitor timestamps. But, if Snapshot #7 saves that file, it'll have a newer timestamp than if Snapshot #42 (newer) did. I'm hoping that your next control script can do something similar to what I recommended.
And I do understand that we really have no idea how much longer these are going to take. I'll continue to babysit them -- so far so good!
I'm actually running them on VirtualBox 4.3.14 RC1. I had to set No New Tasks for all the BOINC VM projects (since 4.3.14 RC1 doesn't play nice with them), but I'm still running other non-VM BOINC tasks alongside these 2 RNA World VMs.
And I do understand that we really have no idea how much longer these are going to take. I'll continue to babysit them -- so far so good!
I'm actually running them on VirtualBox 4.3.14 RC1. I had to set No New Tasks for all the BOINC VM projects (since 4.3.14 RC1 doesn't play nice with them), but I'm still running other non-VM BOINC tasks alongside these 2 RNA World VMs.
-
ChristianB
- Admin

- Beiträge: 1920
- Registriert: 23.02.2010 22:12
Re: Assistance needed - 2 Long-running VMs just failed - Clo
You can just delete the files and see if it comes back.
-
Jacob Klein
- Brain-Bug

- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: Assistance needed - 2 Long-running VMs just failed - Clo
I'm not explaining myself very well.
If I delete the progress.txt file, and start from snapshot #7, the file that gets generated will have a timestamp of x with a progress of .98765.
If I delete the progress.txt file, and start from snapshot #42 (which could have weeks of additional computation done), the file that gets generated will have a timestamp of x with a progress of .98765.
There is no way to know that one snapshot has significantly progressed farther than the other.
I recommend that you keep the existing algorithm that goes up until 0.98765, but that you additionally also create a "new progress algorithm" to supply progress after the current 0.98765.... Maybe increment 0.00001 every hour, with a cap at 0.99999 or something like that.
If I delete the progress.txt file, and start from snapshot #7, the file that gets generated will have a timestamp of x with a progress of .98765.
If I delete the progress.txt file, and start from snapshot #42 (which could have weeks of additional computation done), the file that gets generated will have a timestamp of x with a progress of .98765.
There is no way to know that one snapshot has significantly progressed farther than the other.
I recommend that you keep the existing algorithm that goes up until 0.98765, but that you additionally also create a "new progress algorithm" to supply progress after the current 0.98765.... Maybe increment 0.00001 every hour, with a cap at 0.99999 or something like that.
-
ChristianB
- Admin

- Beiträge: 1920
- Registriert: 23.02.2010 22:12
Re: Assistance needed - 2 Long-running VMs just failed - Clo
Now I understand, but you just push the cap to 0.99999 with your solution and than we have the same problem. The current implementation counts till 100% and then resets to 98.765% which is your 99.999%
The real solution would be to use an exponential function that slows done the counter the more it reaches 100% this way it'll never reach 100% but also slightly move from time to time. Unfortunately this kind of function is not available in bash (or I haven't found it).
The assumption with the timestamp is: As long as the timestamp updates this means the control script is running. As long as the control script is running there is a cmsearch process active within the VM. I assume as long as the process is active it does something useful. Because cmsearch is a blackbox to us we can't say more. If the timestamp does not update but the VM is still running it means something went wrong inside the VM (which should be handled by the control script but was not).
The real solution would be to use an exponential function that slows done the counter the more it reaches 100% this way it'll never reach 100% but also slightly move from time to time. Unfortunately this kind of function is not available in bash (or I haven't found it).
The assumption with the timestamp is: As long as the timestamp updates this means the control script is running. As long as the control script is running there is a cmsearch process active within the VM. I assume as long as the process is active it does something useful. Because cmsearch is a blackbox to us we can't say more. If the timestamp does not update but the VM is still running it means something went wrong inside the VM (which should be handled by the control script but was not).
-
Jacob Klein
- Brain-Bug

- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: Assistance needed - 2 Long-running VMs just failed - Clo
-
Jacob Klein
- Brain-Bug

- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: Assistance needed - 2 Long-running VMs just failed - Clo
I still propose the following as the control script:
- Use your maximum estimate on how long the task will take (so, maximum of: "Estimated time on reference system" [if available]... and the application's "forecast" [if available]). Use the maximum value of any of your estimated guesses.
- Allow progress from 0 to 95%, using that estimate, such that the time at 95% is that maximum estimated time. Do not allow it to go to 100%. It's stupid to allow it to go to 100%, then go back down. We should only progress forward, never backward.
- If more time is needed beyond 95%, increment 0.001% (0.00001) every hour or so. The user should be able to compare the % now, as compared to the % a few hours ago, to see that progress is indeed being made, even if it seems slow. Some users (like me) actually jot down progress %'s to compare later. We need to prevent users thinking "things are stuck" at any given percentage, like 98.765%.
- Put a hard cap at 99.999%. Never allow the progress % to be higher. I realize you are saying that this just "pushes the cap", but think about it. If you were incrementing by 0.00001 every hour, starting at the maximum estimated time at 95%, then the only way it could get to that 99.999% cap is: [maximum estimated time] + [5000 hours]. That, to me seems very reasonable. We would be supplying an actual indication of progress, all the way up to 5000 hours beyond maximum estimate.
Please seriously consider this approach. We absolutely need to prevent users thinking things are stuck.
PS: My 2 tasks are still progressing, as best as I can tell. But soon, when both reach a 98.765% cap, I will have no way of knowing that progress is being made
You can fix this.
- Use your maximum estimate on how long the task will take (so, maximum of: "Estimated time on reference system" [if available]... and the application's "forecast" [if available]). Use the maximum value of any of your estimated guesses.
- Allow progress from 0 to 95%, using that estimate, such that the time at 95% is that maximum estimated time. Do not allow it to go to 100%. It's stupid to allow it to go to 100%, then go back down. We should only progress forward, never backward.
- If more time is needed beyond 95%, increment 0.001% (0.00001) every hour or so. The user should be able to compare the % now, as compared to the % a few hours ago, to see that progress is indeed being made, even if it seems slow. Some users (like me) actually jot down progress %'s to compare later. We need to prevent users thinking "things are stuck" at any given percentage, like 98.765%.
- Put a hard cap at 99.999%. Never allow the progress % to be higher. I realize you are saying that this just "pushes the cap", but think about it. If you were incrementing by 0.00001 every hour, starting at the maximum estimated time at 95%, then the only way it could get to that 99.999% cap is: [maximum estimated time] + [5000 hours]. That, to me seems very reasonable. We would be supplying an actual indication of progress, all the way up to 5000 hours beyond maximum estimate.
Please seriously consider this approach. We absolutely need to prevent users thinking things are stuck.
PS: My 2 tasks are still progressing, as best as I can tell. But soon, when both reach a 98.765% cap, I will have no way of knowing that progress is being made
-
ChristianB
- Admin

- Beiträge: 1920
- Registriert: 23.02.2010 22:12
Re: Assistance needed - 2 Long-running VMs just failed - Clo
I didn't calculate the numbers before. Your proposal is acceptable and I will implement it in the next version.
-
Jacob Klein
- Brain-Bug

- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: Assistance needed - 2 Long-running VMs just failed - Clo
Hurray - you've made my day! The original post mentioned only doing the 0.00001 progression hourly from 98.765%, but I changed the post to indicate it should start at 95% instead, thus allowing more "indication of progress" during the "we are beyond maximum estimate" time period.
So, basically, I hope you implement the 95% version that my post currently shows, which allows hourly progress indication all the way to 5000 hours beyond maximum estimate. Thank you!
I'm still uber excited about these 2 monster tasks - I'll be so happy when they complete! I could care less about the credits - I just want to these beasts to be tackled
So, basically, I hope you implement the 95% version that my post currently shows, which allows hourly progress indication all the way to 5000 hours beyond maximum estimate. Thank you!
I'm still uber excited about these 2 monster tasks - I'll be so happy when they complete! I could care less about the credits - I just want to these beasts to be tackled
-
Jacob Klein
- Brain-Bug

- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: Assistance needed - 2 Long-running VMs just failed - Clo
These 2 tasks are still crunching, outside of BOINC, and I'm still taking snapshots and clones every so often.
Progress.txt of first one: 0.98765 (still)
Progress.txt of second one: 0.966948 (approaching the point where I get to see if it goes beyond 0.98765, or caps)
I will continue to run them. I hope they end up with useful results.
Progress.txt of first one: 0.98765 (still)
Progress.txt of second one: 0.966948 (approaching the point where I get to see if it goes beyond 0.98765, or caps)
I will continue to run them. I hope they end up with useful results.
-
Jacob Klein
- Brain-Bug

- Beiträge: 564
- Registriert: 26.07.2013 15:41
Re: Assistance needed - 2 Long-running VMs just failed - Clo
Both tasks now show Progress.txt of 0.98765. I watched the 2nd task go up to something like 0.998, and then it went to 0.98765, which I guess is expected with this control script. So, at this point, they are both still crunching, and I'll continue working on them, trying to keep my clones and snapshots in order so as not to accidentally lose any progress (which can't be "seen" because of the 0.98765 cap).
I'm looking forward to being done with these, and I'm also looking forward to seeing tasks with the new control script
Thanks,
Jacob
I'm looking forward to being done with these, and I'm also looking forward to seeing tasks with the new control script
Thanks,
Jacob
-
Werinbert
- Taschenrechner

- Beiträge: 14
- Registriert: 26.05.2013 01:51
Re: Assistance needed - 2 Long-running VMs just failed - Clo
I don't have a solution to the OP but I have come across a similar error on one of my long running VM tasks.
The error was triggered by a series of events happening in conjunction. At the time of the incident I was running one RNA task, two Climateathome tasks (also VM), a number of PrimeGrid sieves and DistributedDatamining tasks. The dDM tasks run for only around 15 minutes but the estimated time were around 144 hours each. When Boinc requested more work it down loaded a bunch of these tasks causing a massive increase in future estimated running time. Consequently Boinc scheduler pushed the short deadline PG tasks into priority mode. I suspect that this resulted in the three VM tasks to get pushed into waiting mode (suspicion based on seeing this happen when I was trying to figure out what had happen in the first place). In turn the three VM tasks became screwed up some how. The RNA and one Climateathome task quit prematurely with errors and the other Climateathome task stopped processing with "Scheduler wait: VM job unmanageable, restarting later."
from http://lhcathome2.cern.ch/vLHCathome/fo ... hp?id=1424
The error was triggered by a series of events happening in conjunction. At the time of the incident I was running one RNA task, two Climateathome tasks (also VM), a number of PrimeGrid sieves and DistributedDatamining tasks. The dDM tasks run for only around 15 minutes but the estimated time were around 144 hours each. When Boinc requested more work it down loaded a bunch of these tasks causing a massive increase in future estimated running time. Consequently Boinc scheduler pushed the short deadline PG tasks into priority mode. I suspect that this resulted in the three VM tasks to get pushed into waiting mode (suspicion based on seeing this happen when I was trying to figure out what had happen in the first place). In turn the three VM tasks became screwed up some how. The RNA and one Climateathome task quit prematurely with errors and the other Climateathome task stopped processing with "Scheduler wait: VM job unmanageable, restarting later."
from http://lhcathome2.cern.ch/vLHCathome/fo ... hp?id=1424
So I think the failure in the RNA task is related to the forced suspension of running due to Boinc putting another task into priority mode.Normally something like this happens if a requested suspend or resume request fails.
Vboxwrapper attempts to shut everything down and tells BOINC to reschedule the job in 24 hours. In theory this gives the VirtualBox system time to reset itself so that Vboxwapper can manage the VM again in the next attempt.
More details can be found in the stderr.txt file. If you are feeling adventurous you can look at vbox_trace.txt to see what commands were executed and what return values vboxmanage gave us.
----- Rom