How to handle the VOLUME_ALARM_DATA_UNDER_REPLICATED alarm

Objective

How to handle the VOLUME_ALARM_DATA_UNDER_REPLICATED alarm when it is raised in an HPE Ezmeral Data Fabric cluster.

Environment

MapR Core

Steps

When the VOLUME_ALARM_DATA_UNDER_REPLICATED alarm is raised in an HPE Ezmeral Data Fabric cluster it indicates that the volume replication factor is lower than the desired replication factor set in the volume properties. This can be caused by failing disks or nodes, or the cluster may be running out of storage space. An alarm will be raised and can be seen from the MCS and the output of the command: maprcli alarm list.

To troubleshoot the cause of the alarm use the following steps:

Investigate any nodes that are failing. You can see which nodes have failed by looking at the Node Health pane in the Overview page in MCS.
Check for any failed disks or offline SPs on all cluster nodes.
Determine whether it is possible to bring the node up, else add disks or nodes to the cluster.

Review the CLDB logs ($MAPR_HOME/logs/cldb.log) on the primary node to identify when the CLDB detected one or more volumes under-replicated and to identify the under-replicated containers. For example:


2018-08-02 03:02:34,515 WARN Alarms [RScan]: composeEmailMessage: Alarm raised: VOLUME_ALARM_DATA_UNDER_REPLICATED:1:VOLUME_ALARM; Cluster: my.cluster.com; Volume: mapr.cldb.internal; Message: Volume desired replication is 3, current replication is 2
2018-08-02 03:02:34,515 INFO VolumeInfoInMemory [RScan]: Volume: mapr.cldb.internal, under-replicated containers: 1
2018-08-02 03:02:34,516 WARN Alarms [RScan]: composeEmailMessage: Alarm raised: VOLUME_ALARM_DATA_UNDER_REPLICATED:199560036:VOLUME_ALARM; Cluster: my.cluster.com; Volume: users; Message: Volume desired replication is 3, current replication is 2
2018-08-02 03:02:34,516 INFO VolumeInfoInMemory [RScan]: Volume: users, under-replicated containers: 2064 2391 2392 2393 2394 2395 2506 2507 2508 2509 2510
2018-08-02 03:02:34,517 WARN Alarms [RScan]: composeEmailMessage: Alarm raised: VOLUME_ALARM_DATA_UNDER_REPLICATED:13930177:VOLUME_ALARM; Cluster: my.cluster.com; Volume: mapr.monitoring; Message: Volume desired replication is 3, current replication is 2
2018-08-02 03:02:34,517 INFO VolumeInfoInMemory [RScan]: Volume: mapr.monitoring, under-replicated containers: 2066 2240 2241 2242 2243 2244 2370 2371 2372 2373 2374
2018-08-02 03:02:34,517 WARN Alarms [RScan]: composeEmailMessage: Alarm raised: VOLUME_ALARM_DATA_UNDER_REPLICATED:228927984:VOLUME_ALARM; Cluster: my.cluster.com; Volume: mapr.monitoring.streams; Message: Volume desired replication is 3, current replication is 2
2018-08-02 03:02:34,517 INFO VolumeInfoInMemory [RScan]: Volume: mapr.monitoring.streams, under-replicated containers: 2067 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313
2018-08-02 03:02:34,518 WARN Alarms [RScan]: composeEmailMessage: Alarm raised: VOLUME_ALARM_DATA_UNDER_REPLICATED:252002128:VOLUME_ALARM; Cluster: my.cluster.com; Volume: mapr.vm75-180.support.mapr.com.local.logs; Message: Volume desired replication is 2, current replication is 1
2018-08-02 03:02:34,518 INFO VolumeInfoInMemory [RScan]: Volume: mapr.vm75-180.support.mapr.com.local.logs, under-replicated containers: 2058

Review the MFS logs ($MAPR_HOME/logs/mfs.log-3) on the nodes hosting replicas of the under-replicated containers to find when and why the alarm was raised. Identify the nodes using the command: maprcli dump containerinfo -ids <container id> -json.


2018-08-02 03:01:09,0319 INFO Replication replicate.cc:4236 Replica health update: replica 10.10.75.179:5660 CID:2118 Master Min VN:4231987, ReplicaVN on Disk:4231988 Pending RPCs:0
2018-08-02 03:01:20,8271 ERROR Replication nodefailure.cc:412 Op failed with Connection reset by peer (104) on replica FSID 1128005550106435436 10.10.75.180:5660 for operation of type 39 and version 33121799 on container 2320
2018-08-02 03:01:20,8286 ERROR Replication nodefailure.cc:412 Op failed with Connection reset by peer (104) on replica FSID 1128005550106435436 10.10.75.180:5660 for operation of type 15 and version 22015178 on container 1
2018-08-02 03:01:20,8333 INFO Replication nodefailure.cc:568 Removing replica FSID 1128005550106435436 10.10.75.180:5660 for container (1).
2018-08-02 03:01:20,8375 INFO Replication nodefailure.cc:568 Removing replica FSID 1128005550106435436 10.10.75.180:5660 for container (2320).
2018-08-02 03:01:49,9258 INFO Replication nodefailure.cc:1279 Container 2055, CLDB asked to become master BM, ifClean=1
2018-08-02 03:01:49,9258 INFO Replication nodefailure.cc:1282 FSBecomeMaster has voltype: 0, volume type from cldb: 0
2018-08-02 03:01:49,9441 INFO Replication nodefailure.cc:1652 BM Become master completed successfully for container 2055 at txn:12583149-12583149, write:12583134-12583134, snap:0-0
2018-08-02 03:01:50,9279 INFO Replication nodefailure.cc:1279 Container 2058, CLDB asked to become master BM, ifClean=1

Based on the symptoms identified above review all known DATA_UNDER_REPLICATED alarm issues on the Support Portal to determine if the root cause is a software defect and whether there is a fix/solution available.
This alarm is generally raised when the nodes that store the volumes or replicas have not sent a heartbeat for five minutes. To prevent re-replication during normal maintenance procedures, the HPE Ezmeral Data Fabric software waits a specified interval (the default is one hour) before considering the node dead and re-replicating its data. You can control this interval by setting the cldb.fs.mark.rereplicate.sec parameter using the command: maprcli config save.

If the root cause of the alarm cannot be identified from the above logs and diagnostics or it appears the alarm is due to a software defect and a fix is needed please open a support case with the HPE Ezmeral Data Fabric Support team via the Support Portal and provide all logs and diagnostic data collected as a result of the above steps.

https://docs.datafabric.hpe.com/62/ReferenceGuide/VolumeAlarms-DataUnderReplicated.html

On This Page

Objective
Environment
Steps

Legal Disclaimer: Products sold prior to the November 1, 2015 separation of Hewlett-Packard Company into Hewlett Packard Enterprise Company and HP Inc. may have older product names and model numbers that differ from current models.
Hewlett Packard Enterprise believes in being unconditionally inclusive. Efforts to replace noninclusive terms in our active products are ongoing.

Was this information helpful?

Thank you!