How to handle the NODE_ALARM_SERVICE_NFS_DOWN alarm

Objective

How to handle the NODE_ALARM_SERVICE_NFS_DOWN alarm when it is raised in an HPE Ezmeral Data Fabric cluster.

Environment

MapR Core

Steps

When the NODE_ALARM_SERVICE_NFS_DOWN alarm is raised in an HPE Ezmeral Data Fabric cluster it indicates that the NFS service is not running on one or more nodes. The alarm is raised for each node on which the NFS service is configured but the service is not running. Each alarm occurrence should be handled individually.

When the NFS service encounters a FATAL error or shuts down abruptly the Warden service will attempt to restart the service automatically. It will do so a maximum of three times before waiting for a configured duration (default of thirty minutes) and trying an additional three times. The thirty minute timer can be changed by modifying the services.retryinterval.time.sec parameter in the /opt/mapr/conf/warden.conf file. If the service cannot be started after multiple attempts the NODE_ALARM_SERVICE_NFS_DOWN alarm will be raised and can be seen from the MCS and the output of the command: maprcli alarm list. If Warden is able to successfully restart the service, the alarm is cleared.

To troubleshoot the cause of the alarm use the following steps:

Review the warden logs ($MAPR_HOME/logs/warden.log) to find when the alarm was raised. Ex:


2014-11-12 10:34:51,621 ERROR com.mapr.warden.service.baseservice.Service$ServiceMonitorRun run [nfs_monitor]: Monitor command: [/etc/init.d/mapr-nfs, status]cannot determine if service: nfs is running. Number of retrials exceeded. Closing Zookeeper
2014-11-12 10:34:51,625 INFO  com.mapr.warden.service.baseservice.Service [nfs_monitor]: 49 about to close zk for service: nfs
2014-11-12 10:34:51,785 INFO  com.mapr.warden.service.baseservice.Service [nfs_monitor]: Alarm raising command: [/opt/mapr/bin/maprcli, alarm, raise, -alarm, NODE_ALARM_SERVICE_NFS_DOWN, -entity, node1.mapr.prv, -description, Can not determine if service: nfs is running

Review the NFS server logs ($MAPR_HOME/logs/nfsserver.log) on the affected node at the same time frame as identified in step 1 to determine whether the NFS service encountered a FATAL error or shutdown abruptly. For example:


2014-11-12 10:34:45,9602 INFO nfsserver[1821] fs/nfsd/main.cc:535 ***** NFS server starting: pid=1821, mapr-version: 4.0.1.27334.GA *****
2014-11-12 10:34:45,9603 INFO nfsserver[1821] fs/nfsd/main.cc:549 ******* NFS server MAPR_HOME=/opt/mapr, NFS_PORT=2049, NFS_MGMT_PORT=9998, NFSMON_PORT=9997
2014-11-12 10:34:45,9735 INFO nfsserver[1821] fs/nfsd/mount.cc:2147 Export info: /mapr (rw)
2014-11-12 10:34:45,9798 INFO nfsserver[1821] fs/nfsd/mount.cc:1781 CLDB info: node1:7222 node22:7222
2014-11-12 10:34:46,3558 INFO nfsserver[1821] fs/nfsd/nfsha.cc:476 hostname: node1.mapr.prv, hostid: 0xa81a38e0d7b6068
2014-11-12 10:34:46,3623 INFO nfsserver[1821] fs/nfsd/requesthandle.cc:468 found NFS_HEAPSIZE env var: 236
2014-11-12 10:34:46,1152 INFO nfsserver[1821] fs/nfsd/main.cc:643 NFS server started ... pid=1821, uid=2147483632
2014-11-12 10:34:45,9720 INFO nfsserver[1821] fs/nfsd/nfsserver.cc:927 0.0.0.0[0] running the cmd /opt/mapr/server/maprexecute pmapset set 100003 3 6 2049, ret 0
2014-11-12 10:34:45,9733 INFO nfsserver[1821] fs/nfsd/nfsserver.cc:971 0.0.0.0[0] Use32BitFileId is 1
2014-11-12 10:34:45,9734 INFO nfsserver[1821] fs/nfsd/nfsserver.cc:984 0.0.0.0[0] AutoRefreshExportsTimeInterval is 0
2014-11-12 10:34:45,9735 INFO nfsserver[1821] fs/nfsd/mount.cc:2177 0.0.0.0[0] Allocating export entry 2651400
2014-11-12 10:34:45,9808 INFO nfsserver[1821] fs/nfsd/mount.cc:1858 0.0.0.0[0] Allocating export entry 26513b0
2014-11-12 10:34:45,9950 INFO nfsserver[1821] fs/nfsd/nfsserver.cc:927 0.0.0.0[0] running the cmd /opt/mapr/server/maprexecute pmapset set 100005 3 6 2049, ret 0
2014-11-12 10:34:46,0061 INFO nfsserver[1821] fs/nfsd/nfsserver.cc:927 0.0.0.0[0] running the cmd /opt/mapr/server/maprexecute pmapset set 100005 1 6 2049, ret 0
2014-11-12 10:34:46,0189 INFO nfsserver[1821] fs/nfsd/nfsserver.cc:927 0.0.0.0[0] running the cmd /opt/mapr/server/maprexecute pmapset set 100005 3 17 2049, ret 0
2014-11-12 10:34:46,0341 INFO nfsserver[1821] fs/nfsd/nfsserver.cc:927 0.0.0.0[0] running the cmd /opt/mapr/server/maprexecute pmapset set 100005 1 17 2049, ret 0
2014-11-12 10:34:46,0353 INFO nfsserver[1821] fs/nfsd/mount.cc:1191 0.0.0.0[0] Setting slash-mapr-clusterid clustername my.cluster.com, id 1012313856
2014-11-12 10:34:46,0519 INFO nfsserver[1821] fs/nfsd/requesthandle.cc:335 0.0.0.0[0] using /etc/mtab to check ramfs mount
2014-11-12 10:34:46,1160 ERROR nfsserver[1821] fs/nfsd/nfsha.cc:847 0.0.0.0[0] Error registering with CLDB: Read-only file system, err=0, status=30 cldb=host1:7222
2014-11-12 10:34:51,1172 INFO nfsserver[1821] fs/nfsd/nfsha.cc:476 hostname: node1.mapr.prv, hostid: 0xa81a38e0d7b6068
2014-11-12 10:34:51,1183 INFO nfsserver[1821] fs/nfsd/nfsha.cc:957 exiting: No license to run NFS server in servermode

Based on the symptoms identified in step 2, review the following Knowledge Article to see if the issue can be resolved: How to troubleshoot and resolve issues starting the MapR NFS server
If the NFS service stopped abruptly (i.e. the service crashed and had to be restarted) check for any core files prefixed with 'nfs' under the node's configured cores directory (the default cores directory is /opt/cores).
Based on the symptoms identified in steps 2 and 3, review all known NFS issues on the Support Portal to determine if the root cause is a software defect and whether there is a fix available.

If the root cause of the alarm cannot be identified from the above logs and diagnostics or it appears the NFS service is down due to a software defect and a fix is needed please open a support case with the HPE Ezmeral Data Fabric Support team via the Support Portal and provide all logs and diagnostic data collected as a result of the above steps.

https://docs.datafabric.hpe.com/62/ReferenceGuide/NodeAlarms-NFSGatewayAlarm.html

On This Page

Objective
Environment
Steps

Legal Disclaimer: Products sold prior to the November 1, 2015 separation of Hewlett-Packard Company into Hewlett Packard Enterprise Company and HP Inc. may have older product names and model numbers that differ from current models.
Hewlett Packard Enterprise believes in being unconditionally inclusive. Efforts to replace noninclusive terms in our active products are ongoing.

Was this information helpful?

Thank you!