EF860服务器宕机案例

EF860服务器宕机案例

1、麒麟系统Audit服务导致宕机

问题现象

该台服务器于10月7日凌晨4点38左右宕机,从message日志可以见到大量audit服务的日志:

Oct  7 03:08:05 NFDW4-ARMTSTACK-JSQ-COM102 systemd[1]: session-67676.scope: Succeeded.
Oct  7 03:08:05 NFDW4-ARMTSTACK-JSQ-COM102 audit[3418956]: CRED_DISP pid=3418956 uid=0 auid=0 ses=67675 msg='op=PAM:setcred grantors=pam_kysec,pam_env,pam_tally2,pam_faillock,pam_unix acct="root" exe="/usr/sbin/crond" hostname=? addr=? terminal=cron res=success'
Oct  7 03:08:05 NFDW4-ARMTSTACK-JSQ-COM102 audit[3418956]: USER_END pid=3418956 uid=0 auid=0 ses=67675 msg='op=PAM:session_close grantors=pam_loginuid,pam_keyinit,pam_limits,pam_systemd acct="root" exe="/usr/sbin/crond" hostname=? addr=? terminal=cron res=success'
Oct  7 03:08:05 NFDW4-ARMTSTACK-JSQ-COM102 auditd[3959]: AUDIT:bfree=0,threshold_size=75,fs_space_warning=1
Oct  7 03:08:05 NFDW4-ARMTSTACK-JSQ-COM102 systemd[1]: session-67675.scope: Succeeded.

该问题由系统BUG导致,audit服务将不停吃系统内存,最终导致宕机

处理方法

方法一:升级audit服务

方法二:禁用auditd服务

sudo systemctl stop auditd
sudo systemctl disable auditd

2、内存ECC问题

2.1 lum.log文件

查看lum.log文件,如果ecc_corrected_bit_num有值,说明存在内存CE错误,可建议做一次内存ECC校验

[{"LMU":{"CPU0":{"LMU0":{"ECCCLR":{"description":{"ecc_clr_corr_err":"0","ecc_clr_corr_err_cnt":"0","ecc_clr_uncorr_err":"0","ecc_corr_err_cnt":"0","ecc_uncorr_err_cnt":"0"},"value":"00000000"},"ECCERRCNT":{"description":{"ecc_corr_err_cnt":"0","ecc_uncorr_err_cnt":"0"},"value":"00000000"},"ECCSTAT":{"description":{"ecc_corrected_bit_num":"83","ecc_corrected_err":"38","ecc_uncorrected_err":"0"},"value":"00002653"}},"LMU1":{"ECCCLR":{"description":{"ecc_clr_corr_err":"0","ecc_clr_corr_err_cnt":"0","ecc_clr_uncorr_err":"0","ecc_corr_err_cnt":"0","ecc_uncorr_err_cnt":"0"},"value":"00000000"},"ECCERRCNT":{"description":{"ecc_corr_err_cnt":"0","ecc_uncorr_err_cnt":"0"},"value":"00000000"},"ECCSTAT":{"description":{"ecc_corrected_bit_num":"0","ecc_corrected_err":"0","ecc_uncorrected_err":"0"},"value":"00000000"}},"LMU2":{"ECCCLR":{"description":{"ecc_clr_corr_err":"0","ecc_clr_corr_err_cnt":"0",

2.2


EF860服务器宕机案例
http://gsproj.github.io/2023/11/20/02_测试/05_EF860服务器宕机案例/
作者
GongSheng
发布于
2023年11月20日
许可协议