When the NODE_ALARM_SERVICE_HIVEMETA_DOWN alarm is raised in an HPE Ezmeral Data Fabric cluster it indicates that the Hive Metastore service is not running on one or more nodes. The alarm is raised for each node on which the Hive Metastore service is configured but the service is not running. Each alarm occurrence should be handled individually.
When the Hive Metastore service encounters a FATAL error or shuts down abruptly the Warden service will attempt to restart the service automatically. It will do so a maximum of three times before waiting for a configured duration (default of thirty minutes) and trying an additional three times. The thirty minute timer can be changed by modifying the
services.retryinterval.time.sec parameter in the
/opt/mapr/conf/warden.conf file. If the service cannot be started after multiple attempts the NODE_ALARM_SERVICE_HIVEMETA_DOWN alarm will be raised and can be seen from the MCS and the output of the command:
maprcli alarm list. If Warden is able to successfully restart the service, the alarm is cleared.
To troubleshoot the cause of the alarm use the following steps:
- Review the Warden logs ($MAPR_HOME/logs/warden.log) to find when the alarm was raised. For example:
2018-02-10 11:16:29,909 INFO com.mapr.warden.service.baseservice.Service [main-EventThread]: ZK is closed for service: hivemeta
2018-02-10 11:16:29,915 INFO com.mapr.job.mngmnt.hadoop.metrics.WardenRequestBuilder [hivemeta_monitor]: [e_SERV_FAIL, hostName, ma_host, ma_process]
2018-02-10 11:16:29,915 INFO com.mapr.job.mngmnt.hadoop.metrics.WardenRequestBuilder [hivemeta_monitor]: []
2018-02-10 11:16:29,915 INFO com.mapr.warden.service.baseservice.Service [hivemeta_monitor]: Alarm raising command: [/opt/mapr/bin/maprcli, alarm, raise, -alarm, NODE_ALARM_SERVICE_HIVEMETA_DOWN, -entity, local.novalocal, -description, Can not determine if service: hivemeta is running. Check logs at: /opt/mapr/hive/hive-2.1/logs/mapr]
- Review the Hive Metastore logs ($MAPR_HOME/hive/hive-<version>/logs/<MAPR_USER>/<user>-metastore-<hostname>.log) on the affected node at the same time frame as identified in step 1 to determine whether the Hive Metastore service encountered a FATAL error or shutdown abruptly. For example:
[main]: Metastore Thrift Server threw an exception...
javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:postgresql://host:7432/hive, username = hive. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
org.postgresql.util.PSQLException: FATAL: password authentication failed for user "hive"
at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:291)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:108)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:66)
at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:125)
at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:30)
at org.postgresql.jdbc3g.AbstractJdbc3gConnection.<init>(AbstractJdbc3gConnection.java:22)
at org.postgresql.jdbc4.AbstractJdbc4Connection.<init>(AbstractJdbc4Connection.java:30)
at org.postgresql.jdbc4.Jdbc4Connection.<init>(Jdbc4Connection.java:24)
at org.postgresql.Driver.makeConnection(Driver.java:393)
at org.postgresql.Driver.connect(Driver.java:267)
- If the Hive Metastore service stopped abruptly (i.e. the service crashed and had to be restarted) check for any Java core files and hs_err log files under the node's configured cores directory (the default cores directory is /opt/cores).
- Based on the symptoms identified in steps 2 and 3, review all known Hive Metastore issues on the Support Portal to determine if the root cause is a software defect and whether there is a fix available.
If the root cause of the alarm cannot be identified from the above logs and diagnostics or it appears the Hive Metastore service is down due to a software defect and a fix is needed please open a support case with the HPE Ezmeral Data Fabric Support team via the Support Portal and provide all logs and diagnostic data collected as a result of the above steps.
https://docs.datafabric.hpe.com/62/ReferenceGuide/NodeAlarms-HivemetaAlarm.html