More P2 problems

Fehler und Wünsche zum Projekt yoyo@home
Bugs and wishes for the project yoyo@home
Nachricht
Autor
Dunckx
PDA-Benutzer
PDA-Benutzer
Beiträge: 45
Registriert: 12.11.2014 09:26

More P2 problems

#1 Ungelesener Beitrag von Dunckx » 21.06.2017 15:02

I have just had to abort eight P2 WU after BOINC took over 40GB of memory on my 32GB system. My settings allow BOINC to have 85% of the RAM when the computer is not in use, and 70% when it is. Paging file use is supposedly limited to 50% and the total paging file size is 22.7GB for a total system RAM of 54.7GB of physical RAM + virtual RAM. Task Manager claims that the commit size was 46GB out of 51GB (second screenshot). Please see attached screen captures.

This should not be possible!

So I had to try to get into Task Manager and kill the ecm processes, which took several minutes as the pc would not respond, then when I finally got into BOINC Manager and tried to kill the WU they kept respawning and it took maybe half a dozen attempts to abort all eight. I am amazed that the whole system did not crash!

Please, no more P2 WU on this machine unless there's a fix (and I am willing to be a beta tester.)
All comments and criticisms welcome.

Dunckx
Dateianhänge
yoyo memory problem 210617 14-00.jpg
yoyo memory problem 210617 14-00.jpg (185.13 KiB) 9720 mal betrachtet
yoyo memory problem 210617 14-10.jpg
yoyo memory problem 210617 14-10.jpg (247.74 KiB) 9720 mal betrachtet

Benutzeravatar
yoyo
Vereinsvorstand
Vereinsvorstand
Beiträge: 8045
Registriert: 17.12.2002 14:09
Wohnort: Berlin
Kontaktdaten:

Re: More P2 problems

#2 Ungelesener Beitrag von yoyo » 21.06.2017 20:48

Hello,

which Boinc version you are using? Is it the most recent one?

Can you set the mem_usage_debug logging flag and see if the numbers look right?

yoyo
HILF mit im Rechenkraft-WiKi, dies gibts zu tun.
Wiki - FAQ - Verein - Chat

Bild Bild

Dunckx
PDA-Benutzer
PDA-Benutzer
Beiträge: 45
Registriert: 12.11.2014 09:26

Re: More P2 problems

#3 Ungelesener Beitrag von Dunckx » 22.06.2017 09:06

Hi Yoyo,

I have set the flag as requested. My BOINC version is 7.6.33 (x64). I have had another couple of P2 WU which I will allow to run to see what happens.

My claim above that WU were "respawning" after deletion is wrong. My PC had downloaded a cartload of P2 WU which I hadn't realised were there.

The event log for the last day has many lines like these, the first of these three was the first one which happened and is not a P2. There are around 35 of them, all about the time the system seized up.

21/06/2017 13:41:13 | yoyo@home | Task ecm_uc_1497728397_np_161_2900e6_1_6443_P1_0 exited with zero status but no 'finished' file
21/06/2017 13:41:13 | yoyo@home | If this happens repeatedly you may need to reset the project.
21/06/2017 13:41:13 | yoyo@home | Task ecm_uc_1498005016_np_161_2900e6_1_2_P2_0 exited with zero status but no 'finished' file

I have not reset the project yet, as it is behaving itself at the moment.
There is also this interesting part:

21/06/2017 19:51:58 | yoyo@home | [error] Can't rename output file slots/0/out to projects/www.rechenkraft.net_yoyo/ecm_uc_1497728397_np_161_2900e6_1_6918_P1_0_0: Error 32
21/06/2017 19:52:03 | yoyo@home | [error] read_stderr_file(): malloc() failed

This was long after the near-crash, which happened around 14:00 and I did not notice anything unusual around 19:52.
Thanks for your help.
Dunckx

Dunckx
PDA-Benutzer
PDA-Benutzer
Beiträge: 45
Registriert: 12.11.2014 09:26

Re: More P2 problems

#4 Ungelesener Beitrag von Dunckx » 24.06.2017 11:09

At the moment it is behaving itself, at the risk of inviting another near seize-up I need another stack of P2 WU in order to catch it misbehaving. A problem I now have is that after activating the memory debug flag, the event log is only keeping events for thirty minutes, so there isn't much chance of catching BOINC in the act if it is doing something wrong. Without the memory debug flag, the event log kept stuff 24 hours. I can't sit here watching it 24 hours a day!

I have noticed an oddity, at the beginning of the day when it all went wrong, the event log has:

20/06/2017 08:57:52 | | cc_config.xml not found - using defaults
20/06/2017 08:57:52 | | Starting BOINC client version 7.6.33 for windows_x86_64
20/06/2017 08:57:52 | | log flags: file_xfer, sched_ops, task
20/06/2017 08:57:52 | | Libraries: libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.8
20/06/2017 08:57:52 | | Data directory: E:\ProgramData\BOINC

and I can't help wondering if the problem occurred because the cc_config.xml file was not found where expected, i.e. my machine has no C:\ProgramData\BOINC directory because it's on the E: drive. But the last line above seems to have found it OK, and it then gives after a few more lines:

20/06/2017 08:57:54 | | Reading preferences override file
20/06/2017 08:57:54 | | Preferences:
20/06/2017 08:57:54 | | max memory usage when active: 22871.92MB
20/06/2017 08:57:54 | | max memory usage when idle: 27773.05MB

so it looks like it does find the xml file eventually. I just don't see any indication of a problem unitl it actually happened at 13:41 with the lines:

21/06/2017 13:41:13 | yoyo@home | Task ecm_uc_1497728397_np_161_2900e6_1_6443_P1_0 exited with zero status but no 'finished' file
21/06/2017 13:41:13 | yoyo@home | If this happens repeatedly you may need to reset the project.

followed by a whole load of similar lines. That's when the near-crash occurred. So at the moment I think I need two things to track this down, first a work-around so I can get the event log recording memory activity for 24 hours instead of 30 minutes only, and then another pile of P2 WU to see if the error is reproducible and hopefully via the log identifiable and fixable.

Anyhow, all comments and criticisms welcome.

Dunckx

Benutzeravatar
yoyo
Vereinsvorstand
Vereinsvorstand
Beiträge: 8045
Registriert: 17.12.2002 14:09
Wohnort: Berlin
Kontaktdaten:

Re: More P2 problems

#5 Ungelesener Beitrag von yoyo » 24.06.2017 12:06

If the mem_usage_debug flag would be active, Boinc would report the mem usage per app in the log. Seems not to be the case.

As decribed here https://boinc.berkeley.edu/wiki/client_ ... ation_file you can set the mem_usage_debug flag also in the boinc manager.
HILF mit im Rechenkraft-WiKi, dies gibts zu tun.
Wiki - FAQ - Verein - Chat

Bild Bild

Dunckx
PDA-Benutzer
PDA-Benutzer
Beiträge: 45
Registriert: 12.11.2014 09:26

Re: More P2 problems

#6 Ungelesener Beitrag von Dunckx » 25.06.2017 12:05

I do have the memory debug flag set, but it was not set at the time the incident occurred, hence the above data without the memory info.

If I look at the event log file now, the first lines are:


25/06/2017 11:16:34 | yoyo@home | [mem_usage]
ecm_cw_1498331145_GW_4_463_2685_0: WS 1290.38MB, smoothed 1304.08MB, swap
1287.97MB, 0.00 page faults/sec, user CPU 1979.809, kernel CPU 2.246
25/06/2017 11:16:34 | yoyo@home | [mem_usage]
ecm_cw_1498331145_GW_4_463_2690_0: WS 1290.30MB, smoothed 1350.81MB, swap
1287.97MB, 0.00 page faults/sec, user CPU 1962.196, kernel CPU 1.872
25/06/2017 11:16:34 | yoyo@home | [mem_usage]
ecm_cw_1498331145_GW_4_463_2695_0: WS 1394.63MB, smoothed 1376.70MB, swap
1392.28MB, 0.00 page faults/sec, user CPU 1950.247, kernel CPU 1.966
25/06/2017 11:16:34 | yoyo@home | [mem_usage]
ecm_cw_1498331145_GW_4_463_2700_0: WS 1370.59MB, smoothed 1332.30MB, swap
1392.28MB, 0.00 page faults/sec, user CPU 1944.209, kernel CPU 2.075
25/06/2017 11:16:34 | | [mem_usage] BOINC totals: WS 7911.57MB, smoothed
7681.94MB, swap 7916.90MB, 0.00 page faults/sec
25/06/2017 11:16:34 | | [mem_usage] All others: WS 1370.63MB, swap
1873.64MB, user 14043.538s, kernel 2490.369s
25/06/2017 11:16:34 | | [mem_usage] non-BOINC CPU usage: 0.15%

and the last lines are:

25/06/2017 11:46:53 | yoyo@home | [mem_usage]
ecm_cw_1498331145_GW_4_463_2695_0: WS 3.92MB, smoothed 3.92MB, swap 1.61MB,
0.00 page faults/sec, user CPU 1637.184, kernel CPU 1.529
25/06/2017 11:46:53 | yoyo@home | [mem_usage]
ecm_cw_1498331145_GW_4_463_2700_0: WS 3.92MB, smoothed 3.92MB, swap 1.61MB,
0.00 page faults/sec, user CPU 1616.670, kernel CPU 1.466
25/06/2017 11:46:53 | | [mem_usage] BOINC totals: WS 31.29MB, smoothed
31.29MB, swap 12.84MB, 0.00 page faults/sec
25/06/2017 11:46:53 | | [mem_usage] All others: WS 1419.34MB, swap
1915.05MB, user 14088.560s, kernel 2501.570s
25/06/2017 11:46:53 | | [mem_usage] non-BOINC CPU usage: 0.25%


So presently the event log file is only recording half an hour of activity, in this case from 25/06/2017 11:16:34 to 25/06/2017 11:46:53. I need to be able to increase this so that it will capture several hours, it probably doesn't need to be 24, 8 ought to be enough so long as there's eight or more P2 WU to run concurrently. That should reproduce the problem and give a fair chance of capturing relevant memory issues.

The downside will be that the resulting log file will be big, and a real pain to trawl through to find possible issues, but I don't see any alternative if this thing is to be tracked down and dealt with.

What do you reckon? (Hmm, Rechen? ;-)

Dunckx

Benutzeravatar
yoyo
Vereinsvorstand
Vereinsvorstand
Beiträge: 8045
Registriert: 17.12.2002 14:09
Wohnort: Berlin
Kontaktdaten:

Re: More P2 problems

#7 Ungelesener Beitrag von yoyo » 25.06.2017 13:47

On my Boinc the mem_usage is logged every 10s. So if you have many P2 it takes Boinc at least 10s to recognize that too much memory is consumed.
There will be more ru P2 soon. The P1 of them are already started.
HILF mit im Rechenkraft-WiKi, dies gibts zu tun.
Wiki - FAQ - Verein - Chat

Bild Bild

Dunckx
PDA-Benutzer
PDA-Benutzer
Beiträge: 45
Registriert: 12.11.2014 09:26

Re: More P2 problems

#8 Ungelesener Beitrag von Dunckx » 28.06.2017 20:50

Unfortunately, the last batch of P2s were not quite enough to trigger the problem (yes, I know I'm asking for it...) but what I did get was a snapshot of the BOINC manager dealing with it as you would hope.

The screenshots show that memory use does indeed shoot up past the intended limit to close to 32GB. However, the duration of these two separate peaks is exactly ten seconds, i.e. the time it should take for the logging to pick up the error condition, and this is exactly what you see in the logfiles. I will post just an excerpt.


27/06/2017 08:28:24 | yoyo@home | [mem_usage] ecm_ru_1498382854_10_1243M.c1044_4015_0: WS 3.89MB, smoothed 3.89MB, swap 1.58MB, 0.00 page faults/sec, user CPU 840.752, kernel CPU 0.172
27/06/2017 08:28:24 | yoyo@home | [mem_usage] ecm_uc_1498524915_np_161_2900e6_1_1_P2_0: WS 7031.63MB, smoothed 7042.97MB, swap 7622.87MB, 0.00 page faults/sec, user CPU 1750.908, kernel CPU 2.824
27/06/2017 08:28:24 | yoyo@home | [mem_usage] ecm_uc_1498507216_np_161_2900e6_1_0_P2_0: WS 7397.88MB, smoothed 7408.46MB, swap 7622.87MB, 0.00 page faults/sec, user CPU 1753.623, kernel CPU 3.338
27/06/2017 08:28:24 | yoyo@home | [mem_usage] ecm_uc_1498511711_np_161_2900e6_1_0_P2_0: WS 7361.88MB, smoothed 7384.43MB, swap 7622.84MB, 0.00 page faults/sec, user CPU 1742.141, kernel CPU 3.385
27/06/2017 08:28:24 | yoyo@home | [mem_usage] ecm_uc_1498496712_np_161_2900e6_1_1_P2_0: WS 6260.30MB, smoothed 5778.97MB, swap 6899.37MB, 0.00 page faults/sec, user CPU 293.157, kernel CPU 0.780
27/06/2017 08:28:24 | | [mem_usage] BOINC totals: WS 28059.64MB, smoothed 27622.79MB, swap 29771.37MB, 0.00 page faults/sec
27/06/2017 08:28:24 | | [mem_usage] All others: WS 1104.25MB, swap 1569.44MB, user 16023.253s, kernel 3435.782s
27/06/2017 08:28:24 | | [mem_usage] non-BOINC CPU usage: 0.10%
27/06/2017 08:28:31 | | [mem_usage] enforce: available RAM 27773.05MB swap 20944.19MB
27/06/2017 08:28:31 | yoyo@home | [cpu_sched_debug] enforce: result ecm_uc_1498476611_np_161_2900e6_1_11_P2_1 can't run, too big 6743.07MB > 154.33MB
27/06/2017 08:28:31 | yoyo@home | [cpu_sched_debug] enforce: result ecm_uc_1498059915_np_161_2900e6_1_6_P2_1 can't run, too big 6020.06MB > 154.33MB
27/06/2017 08:28:31 | yoyo@home | [cpu_sched_debug] enforce: result ecm_uc_1498524915_np_161_2900e6_1_2_P2_0 can't run, too big 6016.60MB > 154.33MB
27/06/2017 08:28:34 | yoyo@home | [mem_usage] (not running)ecm_ru_1498382139_10_507M.c200_2697_P1_0: WS 4.07MB, smoothed 4.07MB, swap 1.85MB, 0.00 page faults/sec, user CPU 19306.153, kernel CPU 17.878
27/06/2017 08:28:34 | yoyo@home | [mem_usage] ecm_ru_1498382854_10_1243M.c1044_4015_0: WS 3.89MB, smoothed 3.89MB, swap 1.58MB, 0.00 page faults/sec, user CPU 850.954, kernel CPU 0.172
27/06/2017 08:28:34 | yoyo@home | [mem_usage] ecm_uc_1498524915_np_161_2900e6_1_1_P2_0: WS 7031.63MB, smoothed 7037.30MB, swap 7622.87MB, 0.00 page faults/sec, user CPU 1761.111, kernel CPU 2.824
27/06/2017 08:28:34 | yoyo@home | [mem_usage] ecm_uc_1498507216_np_161_2900e6_1_0_P2_0: WS 7397.88MB, smoothed 7403.17MB, swap 7622.87MB, 0.00 page faults/sec, user CPU 1763.825, kernel CPU 3.338
27/06/2017 08:28:34 | yoyo@home | [mem_usage] ecm_uc_1498511711_np_161_2900e6_1_0_P2_0: WS 7361.88MB, smoothed 7373.16MB, swap 7622.84MB, 0.00 page faults/sec, user CPU 1752.344, kernel CPU 3.385
27/06/2017 08:28:34 | yoyo@home | [mem_usage] ecm_uc_1498496712_np_161_2900e6_1_1_P2_0: WS 6260.90MB, smoothed 6019.94MB, swap 6900.27MB, 0.00 page faults/sec, user CPU 303.360, kernel CPU 0.780
27/06/2017 08:28:34 | | [mem_usage] BOINC totals: WS 28060.25MB, smoothed 27841.52MB, swap 29772.27MB, 0.00 page faults/sec
27/06/2017 08:28:34 | | [mem_usage] All others: WS 1104.20MB, swap 1569.40MB, user 16023.300s, kernel 3435.813s
27/06/2017 08:28:34 | | [mem_usage] non-BOINC CPU usage: 0.10%
27/06/2017 08:28:34 | | [mem_usage] enforce: available RAM 27773.05MB swap 20944.19MB
27/06/2017 08:28:34 | yoyo@home | [cpu_sched_debug] enforce: result ecm_uc_1498496712_np_161_2900e6_1_1_P2_0 can't run, too big 6019.94MB > 5955.53MB
27/06/2017 08:28:34 | yoyo@home | [cpu_sched_debug] enforce: result ecm_uc_1498476611_np_161_2900e6_1_11_P2_1 can't run, too big 6743.07MB > 5955.53MB
27/06/2017 08:28:34 | yoyo@home | [cpu_sched_debug] enforce: result ecm_uc_1498059915_np_161_2900e6_1_6_P2_1 can't run, too big 6020.06MB > 5955.53MB
27/06/2017 08:28:34 | yoyo@home | [cpu_sched_debug] enforce: result ecm_uc_1498524915_np_161_2900e6_1_2_P2_0 can't run, too big 6016.60MB > 5955.53MB
27/06/2017 08:28:34 | yoyo@home | [mem_usage] job using too much memory, will preempt by quit
27/06/2017 08:28:36 | | [mem_usage] enforce: available RAM 27773.05MB swap 20944.19MB
27/06/2017 08:28:36 | yoyo@home | [cpu_sched_debug] enforce: result ecm_uc_1498476611_np_161_2900e6_1_11_P2_1 can't run, too big 6743.07MB > 5955.53MB
27/06/2017 08:28:36 | yoyo@home | [cpu_sched_debug] enforce: result ecm_uc_1498059915_np_161_2900e6_1_6_P2_1 can't run, too big 6020.06MB > 5955.53MB
27/06/2017 08:28:36 | yoyo@home | [cpu_sched_debug] enforce: result ecm_uc_1498524915_np_161_2900e6_1_2_P2_0 can't run, too big 6016.60MB > 5955.53MB
27/06/2017 08:28:36 | yoyo@home | [cpu_sched_debug] enforce: result ecm_uc_1498496712_np_161_2900e6_1_1_P2_0 can't run, too big 6019.94MB > 5955.53MB
27/06/2017 08:28:45 | yoyo@home | [mem_usage] (not running)


That's enough to show it is working as it should with five P2 running.

Now I need to find out what is going wrong and for that it will take (cringe) more P2 WU. I am determined to track this down now. My best guess so far is that when you have eight simultaneous P2 running with more in the pipeline, something goes wrong during those "ten seconds of grace" in which the BOINC Manager has time to spot the fault and take immediate action. For some reason it doesn't apply corrective action in time and the thing locks up. It seems to share some of the symptoms of excessive delay in closed loop mechanical servo systems, which, thinking about it, is pretty well what this is, but in software. It may stop one or more running P2 only to start one or more new P2 but instead of the memory requirements going down, they go up, and with yet more P2 awaiting, you've got a positive feedback cycle in which a new P2 replaces yet another stalled P2 and you've got a system lock-up on your hands.

OK, that's all guesswork, but I will be surprised if I am that far off the mark.
All comments and criticisms welcome.

Dunckx
Dateianhänge
yoyo memory problem 270617 0858.jpg
yoyo memory problem 270617 0858.jpg (72.25 KiB) 9596 mal betrachtet
yoyo memory problem 270617 0919.jpg
yoyo memory problem 270617 0919.jpg (168.91 KiB) 9596 mal betrachtet

Benutzeravatar
yoyo
Vereinsvorstand
Vereinsvorstand
Beiträge: 8045
Registriert: 17.12.2002 14:09
Wohnort: Berlin
Kontaktdaten:

Re: More P2 problems

#9 Ungelesener Beitrag von yoyo » 28.06.2017 21:01

I assume, that if you e.g. start a fresh Boinc installation you get in your case 8 P2 at the same time and they start at the exact same time, because there is enough ram available. But then all 8 allocate at the same time 10 GB memory each and the system gets very slow and swaps memory to disk and vice versa. It takes 10 s for Boinc to recognize it and stoping those WUs is also very slow because the system is already very intensive swapping.
HILF mit im Rechenkraft-WiKi, dies gibts zu tun.
Wiki - FAQ - Verein - Chat

Bild Bild

Dunckx
PDA-Benutzer
PDA-Benutzer
Beiträge: 45
Registriert: 12.11.2014 09:26

Re: More P2 problems

#10 Ungelesener Beitrag von Dunckx » 30.06.2017 09:02

The story so far...

The last batch of P2 was being dealt with fine by the BOINC Manager, so I tried to think of what I was doing when the original problem occurred. I was using VLC Player to watch a video stored on my data hard drive, which is a 500GB rotating platter, not ssd. Sure enough, I got symptoms of the earlier freeze, but had to wait a while. The signs that something was wrong were jerky replay of the video and pixels changing to areas of blocky colours during replay, so it was affecting video ram, or it was somehow interrupting GPU action, so my next plan of attack will be to try disabling GPU computing and enabling it and seeing if this makes a difference.

At the moment, GPU computing is disabled, since Yoyo doesn't use it, but it was enabled (my error) when the original lockup occurred.

The pc recovered faily quickly once I quit VLC player, and as this happened twice, I got two screenshots, neither of which show the enormous memory problem of the original. I can only post one as the other is 500k or so and I can't trim it any smaller. However, on the one shown here you can see an area to the top left of the Manager window which is "discoloured", so something has corrupted the video memory. There's a similar video flaw in the original screenshots where a black area appears, in the file "yoyo memory problem 210617 1400.jpg", around the Windows Task Manager screen at the bottom left, kernel memory paged/unpaged.

I haven't done analysing the log files, but will post anything of interest later.

Overnight I left it running a pile of P2 with GPU computing disabled and nothing else running and the system was still crunching this morning.

More as and when I find details of interest.
Dunckx
Dateianhänge
yoyo memory problem 290617 2150.jpg
yoyo memory problem 290617 2150.jpg (135.09 KiB) 9563 mal betrachtet

Benutzeravatar
yoyo
Vereinsvorstand
Vereinsvorstand
Beiträge: 8045
Registriert: 17.12.2002 14:09
Wohnort: Berlin
Kontaktdaten:

Re: More P2 problems

#11 Ungelesener Beitrag von yoyo » 02.07.2017 08:18

There is a fix but not yet in any recommended version.

The client estimates an unstarted task's memory usage to be the maximum working set size of already running tasks using the same app version. Previously, if there were no running tasks the estimate was zero bytes. So in this case many of such P2 would be started.

Now if there are no running tasks the estimate is <rsc_memory_bound> which the wu generator sets to 10 GB.
HILF mit im Rechenkraft-WiKi, dies gibts zu tun.
Wiki - FAQ - Verein - Chat

Bild Bild

Dunckx
PDA-Benutzer
PDA-Benutzer
Beiträge: 45
Registriert: 12.11.2014 09:26

Re: More P2 problems

#12 Ungelesener Beitrag von Dunckx » 02.07.2017 20:40

Thanks for the update, Yoyo. I still haven't spent enough time on the log files, but a quick look hasn't shown anything obviously wrong. I'm going to be away for the next couple of days, but will get back to it when I return.

Dunckx

Antworten

Zurück zu „Fehler, Wünsche / Bugs, Wishes“