0 result(s) found
No result found

Objective

How to handle the NODE_ALARM_SERVICE_WEBSERVER_DOWN alarm when it is raised in an HPE Ezmeral Data Fabric cluster.

Environment


MapR Core

Steps

When the NODE_ALARM_SERVICE_WEBSERVER_DOWN alarm is raised in an HPE Ezmeral Data Fabric cluster it indicates that the Webserver service is not running on one or more nodes. The alarm is raised for each node on which the Webserver service is configured but the service is not running.  Each alarm occurrence should be handled individually.

When the Webserver service encounters a FATAL error or shuts down abruptly the Warden service will attempt to restart the service automatically. It will do so a maximum of three times before waiting for a configured duration (default of thirty minutes) and trying an additional three times.  The thirty minute timer can be changed by modifying the  services.retryinterval.time.sec parameter in the  /opt/mapr/conf/warden.conf file.  If the service cannot be started after multiple attempts the NODE_ALARM_SERVICE_WEBSERVER_DOWN alarm will be raised and can be seen from the MCS and the output of the command:  maprcli alarm list.  If Warden is able to successfully restart the service, the alarm is cleared.

To troubleshoot the cause of the alarm use the following steps:
  1. Review the Warden logs ($MAPR_HOME/logs/warden.log) to find when the alarm was raised. For example:
    2017-10-23 16:02:29,625 INFO com.mapr.warden.service.baseservice.Service [webserver_monitor]: 49 about to close zk for service: webserver 2017-10-23 16:02:29,785 INFO com.mapr.warden.service.baseservice.Service [webserver_monitor]: Alarm raising command: [/opt/mapr/bin/maprcli, alarm, raise, -alarm, NODE_ALARM_SERVICE_WEBSERVER_DOWN, -entity, node1.mapr.prv, -description, Can not determine if service: webserver is running]
  2. Review the Webserver logs ($MAPR_HOME/apiserver/logs/apiserver.log) on the affected node at the same time frame as #1 to identify whether the service encountered a FATAL error or shutdown abruptly. For example:
    *** glibc detected *** java: free(): invalid pointer: 0x00007f3df4000088 *** ======= Backtrace: ========= /lib64/libc.so.6[0x35d3e75366] /lib64/security/pam_unix.so(+0x4e14)[0x7f3e616a4e14] /lib64/security/pam_unix.so(pam_sm_authenticate+0x1f3)[0x7f3e616a3353] /lib64/libpam.so.0[0x35d5a02cee] /lib64/libpam.so.0(pam_authenticate+0x40)[0x35d5a02600] /opt/mapr/lib/libjpam.so(Java_net_sf_jpam_Pam_authenticate+0x390)[0x7f3e61ed2497] [0x7f3e71011f90] ======= Memory map: ======== 00400000-00401000 r-xp 00000000 fd:00 795080 /usr/java/jdk1.7.0_13/bin/java 00600000-00601000 rw-p 00000000 fd:00 795080 /usr/java/jdk1.7.0_13/bin/java 019dd000-019fe000 rw-p 00000000 00:00 0 [heap] cc000000-cd4c0000 rw-p 00000000 00:00 0 cd4c0000-d1200000 rw-p 00000000 00:00 0 d1200000-e6020000 rw-p 00000000 00:00 0 e6020000-f0600000 rw-p 00000000 00:00 0 f0600000-fad10000 rw-p 00000000 00:00 0 fad10000-100000000 rw-p 00000000 00:00
  3. If the Webserver shutdown abruptly (i.e. the service crashed and had to be restarted) check for any Java core files and hs_err log files under the node's cores configured directory (the default cores directory is /opt/cores).
  4. Based on the symptoms identified in #2 and #3 review all known Webserver/MCS issues on the Support Portal to determine if the root cause is a software defect and whether there is a fix available.

If the root cause of the alarm cannot be identified from the above logs and diagnostics or it appears the Webserver service is down due to a software defect and a fix is needed please open a support case with the HPE Ezmeral Data Fabric Support team via the Support Portal and provide all logs and diagnostic data collected as a result of the above steps.

https://docs.datafabric.hpe.com/62/ReferenceGuide/NodeAlarms-WebServerServiceAlarm.html
Legal Disclaimer: Products sold prior to the November 1, 2015 separation of Hewlett-Packard Company into Hewlett Packard Enterprise Company and HP Inc. may have older product names and model numbers that differ from current models.
Hewlett Packard Enterprise believes in being unconditionally inclusive. Efforts to replace noninclusive terms in our active products are ongoing.
This page has an error. You might just need to refresh it. [NoErrorObjectAvailable] lightningout:client-error:script-setup:https://support.hpe.com/connect/l/%7B%22mode%22%3A%22PROD%22%2C%22dfs%22%3A%228%22%2C%22app%22%3A%22c%3AdceLightningOutApp%22%2C%22loaded%22%3A%7B%22APPLICATION%40markup%3A%2F%2Fc%3AdceLightningOutApp%22%3A%22739_dzZllEEIH7iFM5YnuoPToQ%22%7D%2C%22styleContext%22%3A%7B%22c%22%3A%22other%22%2C%22x%22%3A%5B%223%22%2C%22SLDS%22%2C%22isDesktop%22%5D%2C%22tokens%22%3A%5B%22markup%3A%2F%2Fforce%3AsldsTokens%22%2C%22markup%3A%2F%2Fforce%3Abase%22%2C%22markup%3A%2F%2Fforce%3AformFactorLarge%22%5D%2C%22tuid%22%3A%22614bpEEa2TzAg8-fXeYADg%22%2C%22cuid%22%3A547826658%7D%2C%22pathPrefix%22%3A%22%2Fconnect%22%7D/app.css?3=
wiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwiwi