分析IO 高案例

案例分析

1.首先,执行 top 发现负载很高,其次,发送CPU 利用率比较低,但是iowait 很高,高达63.4,再次发现进程中占CPU和内存最高的是一个python 应用。进程ID 5593

1
2
3
4
5
6
7
8
9
10
11
12
13
top - 15:21:09 up 22:38,  2 users,  load average: 2.54, 2.27, 1.35
Tasks: 86 total, 3 running, 50 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.2 us, 16.4 sy, 0.0 ni, 16.9 id, 63.4 wa, 0.0 hi, 0.0 si, 0.2 st
KiB Mem : 8156288 total, 5366112 free, 1109200 used, 1680976 buff/cache
KiB Swap: 8388604 total, 8388080 free, 524 used. 6763836 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5593 root 20 0 963312 934276 5320 R 37.5 11.5 4:49.05 python
5648 root 20 0 0 0 0 D 1.3 0.0 0:03.47 kworker/u128:2
1043 root 20 0 1044268 64328 27400 S 0.7 0.8 2:02.97 dockerd
8 root 20 0 0 0 0 R 0.3 0.0 0:00.96 rcu_sched
3937 root 20 0 103864 7148 6148 S 0.3 0.1 0:00.60 sshd
1 root 20 0 225036 8624 6508 S 0.0 0.1 0:02.31 systemd

2.使用 iostat -d -x 1 查看所有磁盘的io情况,发现 vda 的util 高达98% 接近饱和,其次发现IO高的原因是因为往磁盘不停的写数据,平均每秒写入数据 105856KB

3.我们通过pidstat -d 1 查看是哪个进程占用IO高,发现果然是这个python 进程。

4.然后通过 strace 追踪Python 进程 5593 发现其一直在调用write函数向文件描述符变化为 3的文件写入文件 平均每秒写入数据 300 MB(314572844/1024/1024) 观察stat,他打开的是/tmp/logtest.txt 文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
root@linux:~# strace -p 5593
strace: Process 5593 attached
stat("/tmp/logtest.txt", {st_mode=S_IFREG|0644, st_size=943718535, ...}) = 0
rename("/tmp/logtest.txt", "/tmp/logtest.txt.1") = 0
open("/tmp/logtest.txt", O_WRONLY|O_CREAT|O_APPEND|O_CLOEXEC, 0666) = 3
fcntl(3, F_SETFD, FD_CLOEXEC) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
lseek(3, 0, SEEK_END) = 0
ioctl(3, TIOCGWINSZ, 0x7ffcc341db60) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR) = 0
ioctl(3, TIOCGWINSZ, 0x7ffcc341da80) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR) = 0
mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f016accf000
mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f01580ce000
write(3, "2019-03-29 07:26:28,703 - __main"..., 314572844) = 314572844
munmap(0x7f01580ce000, 314576896) = 0
write(3, "\n", 1) = 1
munmap(0x7f016accf000, 314576896) = 0
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout)
getpid() = 1
mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f016accf000
mmap(NULL, 393220096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f01535ce000
mremap(0x7f01535ce000, 393220096, 314576896, MREMAP_MAYMOVE) = 0x7f01535ce000
munmap(0x7f016accf000, 314576896) = 0
lseek(3, 0, SEEK_END) = 314572845
lseek(3, 0, SEEK_CUR) = 314572845
munmap(0x7f01535ce000, 314576896) = 0
mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f016accf000
mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f01580ce000
write(3, "2019-03-29 07:26:37,480 - __main"..., 314572844) = 314572844
munmap(0x7f01580ce000, 314576896) = 0

5.我们进一步确认,使用lsof 查看该进程调用情况,发现其确实打开了文件/tmp/logtest.txt,这个输出界面 3w 表示3号文件描述符的权限为写(w)

1
2
3
4
5
6
7
8
9
10
root@linux:~# lsof -p 5593
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python 5593 root cwd DIR 0,51 4096 131090 /
python 5593 root rtd DIR 0,51 4096 131090 /
python 5593 root txt REG 0,51 28016 266366 /usr/local/bin/python3.7
python 5593 root mem REG 252,1 266366 /usr/local/bin/python3.7 (stat: No such file or directory)
*****
python 5593 root 1u CHR 136,0 0t0 3 /dev/pts/0
python 5593 root 2u CHR 136,0 0t0 3 /dev/pts/0
python 5593 root 3w REG 252,1 916045824 2914 /tmp/logtest.txt