Linux 的 oom 机制

Linux 的内核当检测到系统内存不足、挑选并杀掉某个进程的过程可以参考内核源代码 linux/mm/oom_kill.c,当系统内存不足的时候,out_of_memory() 被触发,然后调用 select_bad_process() 选择一个 “bad” 进程杀掉,如何判断和选择一个 “bad” 进程呢,总不能随机选吧?挑选的过程由 oom_badness() 决定,挑选的算法和想法都很简单很朴实:最 bad 的那个进程就是那个最占用内存的进程。

Out of memory 的问题。通常都是因为某时刻应用程序大量请求内存导致系统内存不足造成的,这通常会触发 Linux 内核里的 Out of Memory (OOM) killer,OOM killer 会杀掉某个进程以腾出内存留给系统用,不致于让系统立刻崩溃。执行

dmesg -T

可以得到类似如下信息

[Tue Jan  9 12:04:19 2018] [54445]  1001 54445    72525     2440     124        0          -998 httpd
[Tue Jan  9 12:04:19 2018] [54446]  1001 54446    72531     2419     124        0          -998 httpd
[Tue Jan  9 12:04:19 2018] [54447]  1001 54447    72531     2419     124        0          -998 httpd
[Tue Jan  9 12:04:19 2018] [54448]  1001 54448    73061     2521     126        0          -998 httpd
[Tue Jan  9 12:04:19 2018] [54449]  1001 54449    72531     2419     124        0          -998 httpd
[Tue Jan  9 12:04:19 2018] [54450]  1001 54450    72531     2419     124        0          -998 httpd
[Tue Jan  9 12:04:19 2018] [54453]  1001 54453    72531     2419     124        0          -998 httpd
[Tue Jan  9 12:04:19 2018] [54454]  1001 54454    72531     2515     124        0          -998 httpd
[Tue Jan  9 12:04:19 2018] [54459]  1001 54459    72531     2401     124        0          -998 httpd
[Tue Jan  9 12:04:19 2018] [54460]  1001 54460    72531     2401     124        0          -998 httpd
[Tue Jan  9 12:04:19 2018] [54461]  1001 54461    72531     2404     124        0          -998 httpd
[Tue Jan  9 12:04:19 2018] [54462]  1001 54462    72531     2403     124        0          -998 httpd
[Tue Jan  9 12:04:19 2018] [54463]  1001 54463    72531     2400     124        0          -998 httpd
[Tue Jan  9 12:04:19 2018] [54464]  1001 54464    72531     2399     124        0          -998 httpd
[Tue Jan  9 12:04:19 2018] [54465]  1001 54465    72531     2401     124        0          -998 httpd
[Tue Jan  9 12:04:19 2018] [54481]     0 54481     9141      125      14        0          1000 combadvisor
[Tue Jan  9 12:04:19 2018] [54488]     0 54488     1029       20       8        0          1000 comb.sh
[Tue Jan  9 12:04:19 2018] [54490]     0 54490     4436       80      13        0          1000 bash
[Tue Jan  9 12:04:19 2018] [54491]     0 54491     1050       23       7        0         -1000 sh
[Tue Jan  9 12:04:19 2018] Out of memory: Kill process 53269 (combdeploy) score 1003 or sacrifice child
[Tue Jan  9 12:04:19 2018] Killed process 54488 (comb.sh) total-vm:4116kB, anon-rss:80kB, file-rss:0kB
[Tue Jan  9 12:04:19 2018] sh invoked oom-killer: gfp_mask=0x2000d0, order=2, oom_score_adj=-1000
[Tue Jan  9 12:04:19 2018] sh cpuset=system mems_allowed=0
[Tue Jan  9 12:04:19 2018] CPU: 1 PID: 54491 Comm: sh Not tainted 3.18.20-nce-amd64 #35
[Tue Jan  9 12:04:19 2018] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[Tue Jan  9 12:04:19 2018]  0000000000000000 0000000000000000 ffffffff8142f7ed ffff88002d8d0a50
[Tue Jan  9 12:04:19 2018]  ffffffff8142d8be ffffffff81089448 0000000000000000 0000000000000001
[Tue Jan  9 12:04:19 2018]  ffff88008ffdcb00 ffff88008ffdcb00 ffffffff8109e22a ffffffff818b2030
[Tue Jan  9 12:04:19 2018] Call Trace:
[Tue Jan  9 12:04:19 2018]  [<ffffffff8142f7ed>] ? dump_stack+0x41/0x51
[Tue Jan  9 12:04:19 2018]  [<ffffffff8142d8be>] ? dump_header+0x6f/0x1e2
[Tue Jan  9 12:04:19 2018]  [<ffffffff81089448>] ? rcu_batches_completed+0x8/0x8
[Tue Jan  9 12:04:19 2018]  [<ffffffff8109e22a>] ? smp_call_function_single+0x6d/0x82
[Tue Jan  9 12:04:19 2018]  [<ffffffff8143361e>] ? _raw_spin_unlock_irqrestore+0xc/0xd
[Tue Jan  9 12:04:19 2018]  [<ffffffff810e4241>] ? oom_kill_process+0x72/0x2f0
[Tue Jan  9 12:04:19 2018]  [<ffffffff810e3ffd>] ? find_lock_task_mm+0x1e/0x6b
[Tue Jan  9 12:04:19 2018]  [<ffffffff810e4a4c>] ? out_of_memory+0x42f/0x462

可以看到

Out of memory: Kill process 53269 (combdeploy) 

表示process 53269 最先被 kill