一套监控EMC VNX存储的小脚本(可用于Zabbix)

项目地址: https://github.com/zhangrj/EMC-VNX-Storage-Zabbix-Monitor

开发背景

EMC VNX5500存储是公司最核心的存储设备,一旦出问题,整个平台就会陷入瘫痪。在我到来之前,EMC存储的巡检完全依赖人工远程与现场代维,今年5月份的时候,我开始着手解决这个问题。

最先想到的监控方法肯定是SNMP/SNMPTRAP,但很不幸的是,我找了大半天也没有找到配置SNMP或SNMPTRAP的地方,也没有搜索到设备的MIB参考文档。在浏览相关资料的时候,找到了通过命令行配置存储设备的管理工具Navisphere,使用该工具可查看存储状态,简单编写一点程序结合Zabbix即可实现监控。

Navisphere命令行工具安装

安装包:https://github.com/zhangrj/EMC-VNX-Storage-Zabbix-Monitor/blob/master/安装软件/NaviCLI-Linux-64-x86-en_US-7.33.9.2.36-1.x86_64.rpm

正常使用rpm安装即可

[root@localhost ~]# rpm -ivh NaviCLI-Linux-64-x86-en_US-7.33.9.2.36-1.x86_64.rpm 
Preparing...                ########################################### [100%]
   1:NaviCLI-Linux-64-x86-en########################################### [100%]
Run the script /opt/Navisphere/bin/setlevel_cli.sh  to set the security level before you proceed.

根据提示设置安全等级,输入2选择medium等级即可。

[root@localhost ~]# /opt/Navisphere/bin/setlevel_cli.sh
Please enter the verifying level(low|medium|l|m) to set? 
2
Setting (default) medium verifying level.....

Verification level medium has been set SUCCESSFULLY!!!

创建一个安全文件,这样使用时就不用再输入用户名和密码。安全文件是加密的,且与本机绑定,user参数为EMC管理用户名、password为密码,scope域的值对应<0 – global; 1 – local; 2 – LDAP>:

[root@localhost ~]# cd /opt/Navisphere/bin/

[root@localhost bin]# ls
admsnap  naviseccli  setlevel_cli.sh  setlevel.log

[root@localhost bin]# ./naviseccli -AddUserSecurity -user emc_username -password emc_passwd -scope 0

[root@localhost bin]# cd /root

[root@localhost ~]# ls
SecuredCLISecurityFile.xml
SecuredCLIXMLEncrypted.key

第一次执行查询命令需要保存证书,选择2接受并保存,再次执行命令即可直接显示信息:

[root@localhost ~]# cd /opt/Navisphere/bin/

[root@localhost bin]# ls
admsnap  naviseccli  setlevel_cli.sh  setlevel.log

[root@localhost bin]# ./naviseccli -h 192.168.130.75 getcrus
Unable to validate the identity of the server.  There are issues with the certificate presented.
Only import this certificate if you have reason to believe it was sent by a trusted source.
Certificate details:
Subject:        CN=192.168.130.75,CN=A-IMAGE,C=US,ST=Massachusetts,L=Southboro,O=EMC Corporation,OU=CLARiiON
Issuer: CN=192.168.130.75,CN=A-IMAGE,C=US,ST=Massachusetts,L=Southboro,O=EMC Corporation,OU=CLARiiON
Serial#:        fe91a4ec
Valid From:     20121126045806Z
Valid To:       20271123045806Z
Would you like to [1]Accept the certificate for this session, [2] Accept and store, [3] Reject the certificate?
Please input your selection(The default selection is [1]):
2
DPE7 Bus 0 Enclosure 0       
SP A State:                 Present
SP B State:                 Present
......

[root@localhost bin]# ./naviseccli -h 192.168.130.75 getcrus
DPE7 Bus 0 Enclosure 0       
SP A State:                 Present
SP B State:                 Present
......

查看已保存的证书:

[root@localhost ~]# /opt/Navisphere/bin/naviseccli security -certificate -list
--------------------------------------------
Subject:        CN=192.168.130.75,CN=A-IMAGE,C=US,ST=Massachusetts,L=Southboro,O=EMC Corporation,OU=CLARiiON
Issuer: CN=192.168.130.75,CN=A-IMAGE,C=US,ST=Massachusetts,L=Southboro,O=EMC Corporation,OU=CLARiiON
Serial#:        fe91a4ec
Valid From:     20121126045806Z
Valid To:       20271123045806Z
--------------------------------------------

NaviSecCLI常用命令

显示系统中各组件状态:

     naviseccli -h <ip> getcrus

显示哪个SP是某个LUN默认和当前的主SP:

     naviseccli -h <ip> getlun -default -owner

显示指定行数的SPlog日志(如:200行):

     naviseccli -h <ip> getlog -200

或将输出结果另存为本地文件:   

     naviseccli -h <ip> getlog -200 > getlog_spa.txt

确认SP Agent状态:

     naviseccli -h <ip> getagent

显示主机LUN和阵列LUN信息:

     naviseccli -h <ip> storagegroup -list

显示RAID Group基本信息:

     naviseccli -h <ip> getrg 0

显示磁盘信息:

     naviseccli -h <ip> getdisk

     naviseccli -h <ip> getdisk 0_0_5

找出哪些LUN有Dirty Cache:

     naviseccli -h <ip> getlun -luncache

显示Rebuild进度:

     naviseccli -h <ip> getlun [lun] -prb

收集SPCollects日志:

     naviseccli -h <ip> spcollect

     naviseccli -h <ip> managefiles -retrieve

列出哪些HBA登录了系统中:

     naviseccli -h <ip> port -list

列出组件的部件号:

     naviseccli -h <ip> getresume

显示Cache是否启用及配置信息:

     naviseccli -h <ip> getcache

列出被启用的系统功能包:

     naviseccli -h <ip> ndu -list

Trespass某个LUN:

     naviseccli -h <ip> trespass <lun> 

发起一个后台sniffer检查命令:

     naviseccli -h <ip> setsniffer <lun> -bv -bvtime high -cr

获得Sniffer报告:

     naviseccli -h <ip> getsniffer <lun>

监控脚本介绍及使用方法

emc_discovery.py ,用于构建json数据,实现Zabbix中的自动发现,可自动发现 CPU、DIMM、Disk、I/O、LCC、Power、SP、SPS、SPS Cable 。

emc_state.py ,获取监控项的监控数据。

注意以下几点:

  • 数据均通过zabbix_sender向zabbix_server传递;
  • 需要修改脚本中的EMC存储地址及zabbix_server地址;
  • 两个脚本可能并不适用其他配置的EMC存储,但基本思路及数据处理方法相同,读者可根据自己的存储配置进行修改。
  • 工作杂事太多,没有对脚本进行优化(包括自动发现通用性、处理过程函数化等),先将就一下。

配置两条crontab定时任务即可,例如:

0 23 * * 6 /usr/bin/python /root/EMC/emc_discovery.py > /tmp/emc_discovery.log
5 * * * * /usr/bin/python /root/EMC/emc_state.py > /tmp/emc_state.log

每周六23点执行一次自动发现,每小时取一次监控项数据。

Zabbix web端的配置

导入模板:https://github.com/zhangrj/EMC-VNX-Storage-Zabbix-Monitor/blob/master/Zabbix模板/zbx_EMC_VNX_templates.xml

新建主机,hostname字段与脚本中zabbix_sender的-z参数保持一致即可。

手动执行一次脚本,查看监控数据是否刷新。