Linux宕机排查方法汇总
3
2023-02-07
Linux宕机排查方法汇总
一般为cpu,内存,IO磁盘,应用BUG,内核BUG,硬件等。流程如下
基础信息收集
以CentOS为例
#!/bin/bash echo "系统版本:$(cat /etc/redhat-release)" echo "内核信息:$(uname -a)" echo "SeLinux values 设置情况:$(getenforce)" echo -e "用户信息:\n$(getent passwd)" echo -e "密码信息:\n$(getent shadow)" echo -e "网络信息:\n$(ip addr show)" echo "CPU信息:$(cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c)" echo "物理CPU数:$(cat /proc/cpuinfo |grep 'physical id'|sort |uniq|wc -l)" echo "逻辑CPU数:$(cat /proc/cpuinfo |grep "processor"|wc -l)" echo "CPU核心数:$(cat cat /proc/cpuinfo |grep "cores"|uniq)" echo "CPU综合信息:\n$(lscpu)" echo -e "磁盘UUID信息:\n$(blkid)" echo -e "磁盘信息:\n$(fdisk -l | egrep '/dev|Disk')" echo -e "磁盘分区信息:\n$(lsblk)" echo -e "磁盘空间信息:\n$(df -h)" echo -e "挂载信息:\n$(mount -l)" echo -e "挂载配置文件:\n$(cat /etc/fstab | egrep -v '#|^$')"
打印出当前机器的socket,core和thread的数量
#!/bin/bash # Simple print cpu topology # Author: kodango function get_nr_processor() { grep '^processor' /proc/cpuinfo | wc -l } function get_nr_socket() { grep 'physical id' /proc/cpuinfo | awk -F: '{ print $2 | "sort -un"}' | wc -l } function get_nr_siblings() { grep 'siblings' /proc/cpuinfo | awk -F: '{ print $2 | "sort -un"}' } function get_nr_cores_of_socket() { grep 'cpu cores' /proc/cpuinfo | awk -F: '{ print $2 | "sort -un"}' } echo '===== CPU Topology Table =====' echo echo '+--------------+---------+-----------+' echo '| Processor ID | Core ID | Socket ID |' echo '+--------------+---------+-----------+' while read line; do if [ -z "$line" ]; then printf '| %-12s | %-7s | %-9s |\n' $p_id $c_id $s_id echo '+--------------+---------+-----------+' continue fi if echo "$line" | grep -q "^processor"; then p_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '` fi if echo "$line" | grep -q "^core id"; then c_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '` fi if echo "$line" | grep -q "^physical id"; then s_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '` fi done < /proc/cpuinfo echo awk -F: '{ if ($1 ~ /processor/) { gsub(/ /,"",$2); p_id=$2; } else if ($1 ~ /physical id/){ gsub(/ /,"",$2); s_id=$2; arr[s_id]=arr[s_id] " " p_id } } END{ for (i in arr) printf "Socket %s:%s\n", i, arr[i]; }' /proc/cpuinfo echo echo '===== CPU Info Summary =====' echo nr_processor=`get_nr_processor` echo "Logical processors: $nr_processor" nr_socket=`get_nr_socket` echo "Physical socket: $nr_socket" nr_siblings=`get_nr_siblings` echo "Siblings in one socket: $nr_siblings" nr_cores=`get_nr_cores_of_socket` echo "Cores in one socket: $nr_cores" let nr_cores*=nr_socket echo "Cores in total: $nr_cores" if [ "$nr_cores" = "$nr_processor" ]; then echo "Hyper-Threading: off" else echo "Hyper-Threading: on" fi echo echo '===== END ====='
查看宕机的时间记录和历史登陆还有重启时间
$ last reboot reboot system boot 3.10.0-1160.71.1 Fri Oct 14 09:33 - 09:33 (116+23:59) wtmp begins Fri Oct 14 09:33:25 2022 $ last -F |grep crash
查看历史登录有无异常用户
$ last root pts/0 36.27.66.128 Wed Feb 8 09:33 still logged in reboot system boot 3.10.0-1160.71.1 Fri Oct 14 09:33 - 09:34 (117+00:00) wtmp begins Fri Oct 14 09:33:25 2022
查看系统日志
如linux下的/var/log/下的log日志,包括message,内核报错日志demsg等等,sa记录,是记录cpu,内存等运行的性能文件,记录着运行时的cpu的运行状态。
# 利用sa文件查看宕机时CPU情况 $ sar -u -f /var/log/sa/sa27 | more
# 利用sa文件查看宕机时内存情况 $ sar -r -f /var/log/sa/sa27 Linux 3.10.0-1160.71.1.el7.x86_64 (chengdu-4-4-8) 01/27/2023 _x86_64_ (4 CPU) 12:00:01 AM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty 12:10:01 AM 421372 3361192 88.86 147592 1458608 8076460 213.52 2380496 558808 600 12:20:01 AM 421500 3361064 88.86 147592 1459308 8075748 213.50 2381376 558792 840日志量往往很大,还可以进行模糊查询,如
tail -200/var/log/messages |grep "Error" cat /var/log/dmesg|grep "Error" # 查看内核崩溃日志 tail -200/car/log/messages |grep "crash" # 查看是否出现OOM,一般会出现kill杀死进程的情况 cat /var/log/messages|grep -i "kill" # 还可以查看宕机时间段的日志,查看12月11日15点的日志 cat /vat/log/messages|grep "Feb 11 15*"<
dmesg -T #按时间点查看内核日志 dmesg -T | grep memory #查看和内存相关的日志记录 dmesg -T | grep crash #查看和崩溃相关的日志记录 dmesg -T | grep reboot #查看和重启相关的日志记录 cat /var/log/dmesg #内核日志 cat /var/log/syslog #系统日志 cat /var/log/kernel.log #内核日志(Ubuntu下是kern.log)
查看内存和CPU使用
# 以M为单位查看 $ free -m total used free shared buff/cache available Mem: 3693 1220 346 8 2127 2174 Swap: 0 0 0 # 查看内存使用细节 $ free -l total used free shared buff/cache available Mem: 2046508 119600 802596 524 1124312 1732368 Low: 2046508 1243912 802596 High: 0 0 0 Swap: 0 0 0 # 查看内存使用前十名 $ ps aux|head -1;ps aux|grep -v PID|sort -rn -k +4|head # 查看CPU使用前十名 $ ps aux|head -1;ps aux|grep -v PID|sort -rn -k +3|head查看swap的使用和内存剩余情况和缓存。如果swap用了,且available也不够了,具体还要查看参数cat /proc/sys/vm/swappiness,如果设置为0,说明内存不够了。
$ cat /proc/sys/vm/swappiness 30 $ vmstat -d disk- ------------reads------------ ------------writes----------- -----IO------ total merged sectors ms total merged sectors ms cur sec vda 45186 413 1458046 231569 15231 32912 1406417 147885 0 117 sr0 111 0 808 29 0 0 0 0 0 0 $ vmstat -a procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free inact active si so bi bo in cs us sy id wa st 1 0 0 801700 435464 633528 0 0 295 285 231 291 3 2 91 4 0 $ vmstat -s

查看io和文件系统使用
观察idle和iowait。磁盘读写时会用到缓存,一般为系统内存的40%,但是中间有一个缓冲时间120秒,将要用完这个缓存时,且会等待120秒,才会写入磁盘,在读写频繁的时候容易造成卡住的情况。查看IO的读写速度,如果很慢说明磁盘性能出现瓶颈。$ iostat Linux 3.10.0-1160.71.1.el7.x86_64 (chengdu-4-4-8) 02/08/2023 _x86_64_ (4 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.30 0.00 0.14 0.02 0.00 99.54 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn vda 4.63 2.62 38.68 26442261 391074212 scd0 0.00 0.00 0.00 8536 0
文件系统的使用
$ df -Th Filesystem Type Size Used Avail Use% Mounted on devtmpfs devtmpfs 1.8G 0 1.8G 0% /dev tmpfs tmpfs 1.9G 24K 1.9G 1% /dev/shm tmpfs tmpfs 1.9G 1.3M 1.9G 1% /run tmpfs tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup /dev/vda1 ext4 79G 27G 49G 36% / tmpfs tmpfs 370M 76K 370M 1% /run/user/0 overlay overlay 79G 27G 49G 36% /var/lib/sealer/tmp/.DTmp-426739144 overlay overlay 79G 27G 49G 36% /var/lib/sealer/tmp/.DTmp-994870646 overlay overlay 79G 27G 49G 36% /var/lib/sealer/tmp/.DTmp-638239670 overlay overlay 79G 27G 49G 36% /var/lib/sealer/tmp/.DTmp-986101506 overlay overlay 79G 27G 49G 36% /var/lib/sealer/tmp/.DTmp-567118555 overlay overlay 79G 27G 49G 36% /var/lib/sealer/tmp/.DTmp-363896476 overlay overlay 79G 27G 49G 36% /var/lib/sealer/tmp/.DTmp-535554540 overlay overlay 79G 27G 49G 36% /var/lib/sealer/tmp/.DTmp-112860269 overlay overlay 79G 27G 49G 36% /var/lib/sealer/tmp/.DTmp-435463087 overlay overlay 79G 27G 49G 36% /var/lib/sealer/tmp/.DTmp-221902271 overlay overlay 79G 27G 49G 36% /var/lib/docker/overlay2/6d75ed0a9c680400d4859b8ccc177274e0573ca0e809bb399914b83bac4f6b55/merged
查看安全日志
查看history记录,查看是否有人登陆主机并做了恶意动作,例如关机$ ll /var/log/ | grep secure -rw------- 1 root root 40025429 Feb 8 09:44 secure -rw------- 1 root root 3167268 Nov 9 03:38 secure-202211091667936341.gz -rw------- 1 root root 3487709 Nov 16 03:36 secure-202211161668541021.gz -rw------- 1 root root 3348977 Nov 25 03:42 secure-202211251669318981.gz -rw------- 1 root root 3184127 Dec 4 03:44 secure-202212041670096701.gz -rw------- 1 root root 3473413 Dec 13 03:25 secure-202212131670873161.gz -rw------- 1 root root 3428012 Dec 21 03:43 secure-202212211671565381.gz -rw------- 1 root root 3279661 Dec 28 03:39 secure-202212281672169942.gz -rw------- 1 root root 3414612 Jan 6 03:16 secure-202301061672946221.gz -rw------- 1 root root 3245695 Jan 14 03:13 secure-202301141673637181.gz -rw------- 1 root root 3363629 Jan 22 03:47 secure-202301221674330541.gz -rw------- 1 root root 3485193 Feb 1 03:34 secure-202302011675193642.gz
利用kdump和crash工具分析内核
# 安装工具包 $ yum install -y kexec-tools crash当系统发生内核崩溃时,kdump会将崩溃时的内核映像和内核转储信息保存在指定的目录下。 使用crash工具打开保存的内核映像,并使用命令bt查看堆栈信息,从而分析出崩溃原因。
查看监控软件
如果在能找到宕机时进程的占用情况,可以根据占用异常的服务查看其日志。云厂商控制台查看监控的历史记录图像,找到峰值点和宕机时间点的图像分析。
工具总结

- 0
-
分享