Python结合MegaCLI对RAID列阵定时检测异常自动短信/电话通知
公司有台10个盘组的RAID10的服务器列阵挂了,坏了3个盘,超过了RAID10的容错,已经无法通过列阵恢复数据了。同时挂3个是很小几率的,所以也有一定可能刚开始挂了1-2个不知道,直到第3个挂了down机了才发现。
因为机器很多,人工每天检查一次会比较耗时,每周检查也许又会太长。所以写了个py小脚本,可以定时1小时检查次,出现故障可以立即通知到技术立即去进一步检查、换盘,避免悲剧发生。
1、安装MegaCLI
# wget https://raw.githubusercontent.com/crazy-zhangcong/tools/master/MegaCli8.07.10.tar.gz && tar -zxf MegaCli8.07.10.tar.gz && cd MegaCli8.07.10/Linux/ && rpm -ivh Lib_Utils-1.00-09.noarch.rpm MegaCli-8.02.21-1.noarch.rpm && ln -s /opt/MegaRAID/MegaCli/MegaCli64 /usr/local/bin/MegaCli && MegaCli -v
出现如下信息即正常完成安装
MegaCLI SAS RAID Management Tool Ver 8.02.21 Oct 21, 2011
(c)Copyright 2011, LSI Corporation, All Rights Reserved. Exit Code:
0x00
2、Python脚本
# -*- coding: utf-8 -*-
#!/usr/bin/python
import os
import requests
node = '宽带型VPS宿主服务器1' #填写节点名称
error = 0
def get_status(value):
status = value.split(": ")
return status[1]
def send_warning():
global node
# 语音通知 基于 https://www.mysubmail.com 语音通知接口
voice_url = 'https://api.mysubmail.com/voice/send.json'
voice_params = { 'appid': '',
'to': '13200000000',
'content': '紧急事态:'+node+'硬盘状态异常,请立即检查',
'signature': '' #填写应用密匙
}
voice_res = requests.post(voice_url, data=voice_params)
# print voice_res.text
# 短信通知 基于 https://www.mysubmail.com 短信接口
message_url = 'https://api.mysubmail.com/message/send.json'
message_params = { 'appid': '',
'to': '13200000000',
'content': '【xx科技】紧急事态:'+node+'硬盘状态异常,请立即检查',
'signature': '' #填写应用密匙
}
message_res = requests.post(message_url, data=message_params)
# print message_res.text
# 检查RAID状态
# raidinfos = open('raid.log','r')
# for raidinfo in raidinfos.readlines():
raidinfos = os.popen('/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL -NoLOG').readlines()
for raidinfo in raidinfos:
raidinfo = raidinfo.strip('\n')
if "State" in raidinfo:
status = get_status(raidinfo)
if status != 'Optimal':
error = 1
print(raidinfo+'\n')
# 检查所有磁盘状态
# pdlist = open('raid_pdlist.log','r')
# for line in pdlist.readlines():
pdlist = os.popen('/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL -NoLOG').readlines()
for line in pdlist:
line = line.strip('\n')
if "Media Error Count" in line:
status = get_status(line)
status = int(status)
if status != 0:
error = 1
print(line)
if "Other Error Count" in line:
status = get_status(line)
status = int(status)
if status != int('1158'):
error = 1
print(line)
if "Predictive Failure Count" in line:
status = get_status(line)
status = int(status)
if status != 0:
error = 1
print(line)
if "Firmware state" in line:
status = get_status(line)
if status != 'Online, Spun Up':
error = 1
print(line+'\n')
# 发送通知
if error == 1:
send_warning()
之后通过crontab定时执行就可以了。有很多可以优化的地方,比如记录故障硬盘的编号、哪些参数结果不对。