Print | Rate this content

HPE MSA 2050 SAN Storage - Troubleshooting

USB CLI port connection

These procedures are intended to be used only during initial configuration, for the purpose of verifying that hardware setup is successful. They are not intended to be used as troubleshooting procedures for configured systems using production data and I/O.

MSA 2050 controllers feature a CLI port employing a mini-USB Type B form factor. If encounter problems communicating with the port after cabling computer to the USB device, may need to either download a device driver (Windows), or set appropriate parameters via an operating system command (Linux).

top

Fault isolation methodology

MSA 2050 controllers provide many ways to isolate faults. This section presents the basic methodology used to locate faults within a storage system, and to identify the associated Field Replaceable Units (FRUs) affected.

Use the SMU to configure and provision the system upon completing the hardware installation. As part of this process, configure and enable event notification so the system will notify user when a problem occurs that is at or above the configured severity. With event notification configured and enabled, user can follow the recommended actions in the notification message to resolve the problem, as further discussed in the options presented below.

Basic steps

The basic fault isolation steps are listed below:

  • Gather fault information, including using system LEDs.

  • Determine where in the system the fault is occurring.

  • Review event logs.

  • If required, isolate the fault to a data path component or configuration.

Cabling systems to enable use of the licensed Remote Snap feature to replicate volumes is another important fault isolation consideration pertaining to initial system installation.

Options available for performing basic steps

When performing fault isolation and troubleshooting steps, select the option or options that best suit site environment. Use of any option (four options are described below) is not mutually-exclusive to the use of another option. User can use the SMU to check the health icons/values for the system and its components to ensure that everything is okay, or to drill down to a problem component. If user discover a problem, both the SMU and the CLI provide recommended-action text online. Options for performing basic steps are listed according to frequency of use:

  • Use the SMU.

  • Use the CLI.

  • Monitor event notification.

  • View the enclosure LEDs.

Use the SMU

The SMU uses health icons to show OK, Degraded, Fault, or Unknown status for the system and its components. The SMU enables user to monitor the health of the system and its components. If any component has a problem, the system health will be Degraded, Fault, or Unknown. Use the SMU to drill down to find each component that has a problem, and follow actions in the Recommendation field for the component to resolve the problem.

Use the CLI

As an alternative to using the SMU, user can run the show system command in the CLI to view the health of the system and its components. If any component has a problem, the system health will be Degraded, Fault, or Unknown, and those components will be listed as Unhealthy Components. Follow the recommended actions in the component Health Recommendation field to resolve the problem.

Monitor event notification

With event notification configured and enabled, user can view event logs to monitor the health of the system and its components. If a message tells user to check whether an event has been logged, or to view information about an event in the log, user can do so using either the SMU or the CLI. Using the SMU, user would view the event log and then click on the event message to see detail about that event. Using the CLI, user would run the show events detail command (with additional parameters to filter the output) to see the detail for an event.

View the enclosure LEDs

LEDs can be viewed the on the hardware (while referring to LED descriptions for enclosure model) to identify component status. If a problem prevents access to either the SMU or the CLI, this is the only option available. However, monitoring/management is often done at a management console using storage management interfaces, rather than relying on line-of-sight to LEDs of racked hardware components.

Performing basic steps

Any available options can be used in performing the basic steps comprising the fault isolation methodology.

Gather fault information

When a fault occurs, it is important to gather as much information as possible. Doing so will help user determine the correct action needed to remedy the fault.

Begin by reviewing the reported fault:

  • Is the fault related to an internal data path or an external data path?
  • Is the fault related to a hardware component such as a disk drive module, controller module, or power supply?

By isolating the fault to one of the components within the storage system, user will be able to determine the necessary action more quickly.

Determine where the fault is occurring

Once user have an understanding of the reported fault, review the enclosure LEDs. The enclosure LEDs are designed to alert users of any system faults, and might be what alerted the user to a fault in the first place.

When a fault occurs, the Fault ID status LED on the enclosure right ear illuminates. Check the LEDs on the back of the enclosure to narrow the fault to a FRU, connection, or both. The LEDs also help user identify the location of a FRU reporting a fault.

Use the SMU to verify any faults found while viewing the LEDs. The SMU is also a good tool to use in determining where the fault is occurring if the LEDs cannot be viewed due to the location of the system. The SMU provides user with a visual representation of the system and where the fault is occurring. It can also provide more detailed information about FRUs, data, and faults.

Review the event logs

The event logs record all system events. Each event has a numeric code that identifies the type of event that occurred, and has one of the following severities:

  • Critical. A failure occurred that may cause a controller to shut down. Correct the problem immediately.

  • Error. A failure occurred that may affect data integrity or system stability. Correct the problem as soon as possible.

  • Warning. A problem occurred that may affect system stability, but not data integrity. Evaluate the problem and correct it if necessary.

  • Informational. A configuration or state change occurred, or a problem occurred that the system corrected. No immediate action is required.

    The event logs record all system events. It is very important to review the logs, not only to identify the fault, but also to search for events that might have caused the fault to occur. For example, a host could lose connectivity to a disk group if a user changes channel settings without taking the storage resources assigned to it into consideration. In addition, the type of fault can help user isolate the problem to either hardware or software.

Isolate the fault

Occasionally it might become necessary to isolate a fault. This is particularly true with data paths, due to the number of components comprising the data path. For example, if a host-side data error occurs, it could be caused by any of the components in the data path: controller module, cable, connectors, switch, or data host.

If the enclosure does not initialize

It may take up to two minutes for the enclosures to initialize. If the enclosure does not initialize:

  • Perform a rescan.

  • Power cycle the system.

  • Make sure the power cord is properly connected, and check the power source that it is connected to.

  • Check the event log for errors.

Correcting enclosure IDs

When installing a system with drive enclosures attached, the enclosure IDs might not agree with the physical cabling order. This is because the controller might have been previously attached to some of the same enclosures during factory testing, and it attempts to preserve the previous enclosure IDs if possible. To correct this condition, make sure that both controllers are up, and perform a rescan using the SMU or the CLI. This will reorder the enclosures, but can take up to two minutes for the enclosure IDs to be corrected.

To perform a rescan using the CLI, type the following command:

To rescan using the SMU:

  1. Verify that both controllers are operating normally.

  2. Do one of the following:

    • Point to the System tab and select Rescan Disk Channels.
    • In the System topic, select Action > Rescan Disk Channels.
  3. Click Rescan.

top

Stopping I/O

When troubleshooting disk drive and connectivity faults, stop I/O to the affected disk groups from all hosts and remote systems as a data protection precaution. As an additional data protection precaution, it is recommended to conduct regularly scheduled backups of data.

NOTE: Stopping I/O to a disk group is a host-side task, and falls outside the scope of this document.

When on-site, user can verify there is no I/O activity by briefly monitoring the system LEDs. When accessing the storage system remotely, this is not possible. Remotely, user can use the show disk-group-statistics CLI command to determine if input and output has stopped. Perform these steps:

  1. Using the CLI, run the show disk-group-statistics command. The Reads and Writes outputs show the number of these operations that have occurred since the statistic was last reset, or since the controller was restarted. Record the numbers displayed.

  2. Run the show disk-group-statistics command a second time. This provides a specific window of time (the interval between requesting the statistics) to determine if data is being written to or read from the disk group. Record the numbers displayed.

  3. To determine if any reads or writes occur during this interval, subtract the set of numbers recorded in step 1 from the numbers recorded in step 2

    • If the resulting difference is zero, then I/O has stopped.

    • If the resulting difference is not zero, a host is still reading from or writing to this disk group. Continue to stop I/O from hosts, and repeat step 1 and step 2 until the difference in step 3 is zero.

    Click here to view See the CLI Reference Guide for additional information on the Hewlett Packard Enterprise Information Library .

For more information click here to view article title "Diagnostic Procedures" .

top

Controller failure

Cache memory is flushed to CompactFlash in the case of a controller failure or power loss. During the write to CompactFlash process, only the components needed to write the cache to the CompactFlash are powered by the supercapacitor. This process typically takes 60 seconds per 1Gb of cache. After the cache is copied to CompactFlash, the remaining power left in the supercapacitor is used to refresh the cache memory. While the cache is being maintained by the supercapacitor, the Cache Status LED flashes at a rate of 1/10 second on and 9/10 second off.

NOTE: Transportable cache only applies to single-controller configurations. In dual controller configurations, there is no need to transport cache from a failed controller to a replacement controller because the cache is duplicated between the peer controllers (subject to volume write optimization setting).

If the controller has failed or does not start, is the Cache Status LED on/blinking?

Answer
Actions
No, the Cache LED status is off, and the controller does not boot.
If valid data is thought to be in Flash, Transporting cache; otherwise, replace the controller module.
No, the Cache Status LED is off, and the controller boots.
The system has flushed data to disks. If the problem persists, replace the controller module.
Yes, at a strobe 1:10 rate - 1 Hz, and the controller does not boot.
See Transporting cache below.
Yes, at a strobe 1:10 rate - 1 Hz, and the controller boots.
The system is flushing data to CompactFlash. If the problem persists, replace the controller module.
Yes, at a blink 1:1 rate - 1 Hz, and the controller does not boot.
See Transporting cache below.
Yes, at a blink 1:1 rate - 1 Hz, and the controller boots.
The system is in self-refresh mode. If the problem persists, replace the controller module.

Transporting cache

To preserve the existing data stored in the CompactFlash, user must transport the CompactFlash from the failed controller to a replacement controller using the procedure outlined in HPE MSA Controller Module Replacement Instructions shipped with the replacement controller module. Failure to use this procedure will result in the loss of data stored in the cache module.

NOTE: Remove the controller module only after the copy process is complete, which is indicated by the Cache Status LED being off, or blinking at 1:10 rate.

top

Isolating a host-side connection fault

During normal operation, when a controller module host port is connected to a data host, the ports? host link status/link activity LED is green. If there is I/O activity, the LED blinks green. If data hosts are having trouble accessing the storage system, and user cannot locate a specific fault or cannot access the event logs, use the following procedure. This procedure requires scheduled downtime.

NOTE: Do not perform more than one step at a time. Changing more than one variable at a time can complicate the troubleshooting process.

Host-side connection troubleshooting featuring host ports with SFPs

The procedure below applies to MSA 2050 SAN controller enclosures employing small form factor pluggable (SFP) transceiver connectors in 8/16 Gb FC, 10GbE iSCSI, or 1 Gb iSCSI host interface ports. In the following procedure,SFP and host cable? is used to refer to any of the qualified SFP options supporting Converged Network Controller ports used for I/O or replication.

NOTE: When experiencing difficulty diagnosing performance problems, consider swapping out one SFP at a time to see if performance improves.
  1. Halt all I/O to the storage system as described in ? Stopping I/O?

  2. Check the host link status/link activity LED. If there is activity, halt all applications that access the storage system.

  3. Check the Cache Status LED to verify that the controller cached data is flushed to the disk drives.

    • Solid - Cache contains data yet to be written to the disk.

    • Blinking - Cache data is being written to CompactFlash.

    • Flashing at 1/10 second on and 9/10 second off - Cache is being refreshed by the supercapacitor.

    • Off - Cache is clean (no unwritten data).

  4. Remove the SFP and host cable and inspect for damage.

  5. Reseat the SFP and host cable. Is the host link status/link activity LED on?

    • Yes - Monitor the status to ensure that there is no intermittent error present. If the fault occurs again, clean the connections to ensure that a dirty connector is not interfering with the data path.

    • No - Proceed to the next step.

  6. Move the SFP and host cable to a port with a known good link status. This step isolates the problem to the external data path (SFP, host cable, and host-side devices) or to the controller module port. Is the host link status/link activity LED on?

    • Yes - User now know that the SFP, host cable, and host-side devices are functioning properly. Return the SFP and cable to the original port. If the link status/link activity LED remains off, user have isolated the fault to the controller module port. Replace the controller module.

    • No - Proceed to the next step.

  7. Swap the SFP with the known good one. Is the host link status/link activity LED on?

    • Yes - User have isolated the fault to the SFP. Replace the SFP.

    • No - Proceed to the next step.

  8. Re-insert the original SFP and swap the cable with a known good one. Is the host link status/link activity LED on?

    • Yes - User have isolated the fault to the cable. Replace the cable.

    • No - Proceed to the next step.

  9. Verify that the switch, if any, is operating properly. If possible, test with another port.

  10. Verify that the HBA is fully seated, and that the PCI slot is powered on and operational.

  11. Replace the HBA with a known good HBA, or move the host side cable and SFP to a known good HBA. Is the host link status/link activity LED on?

    • Yes - User have isolated the fault to the HBA. Replace the HBA.

    • No - It is likely that the controller module needs to be replaced.

  12. Move the cable and SFP back to its original port. Is the host link status/link activity LED on?

    • No - The controller module port has failed. Replace the controller module.

    • Yes - Monitor the connection for a period of time. It may be an intermittent problem, which can occur with damaged SFPs, cables, and HBAs.

top

Isolating a controller module expansion port connection fault

During normal operation, when a controller module expansion port is connected to a drive enclosure, the expansion port status LED is green. If the connected ports? expansion port LED is off, the link is down. Use the following procedure to isolate the fault.

This procedure requires scheduled downtime.

NOTE: Do not perform more than one step at a time. Changing more than one variable at a time can complicate the troubleshooting process.
  1. Halt all I/O to the storage system.

  2. Check the host activity LED. If there is activity, halt all applications that access the storage system.

  3. Check the Cache Status LED to verify that the controller cached data is flushed to the disk drives.

    • Solid - Cache contains data yet to be written to the disk.

    • Blinking - Cache data is being written to CompactFlash.

    • Flashing at 1/10 second on and 9/10 second off - Cache is being refreshed by the supercapacitor.

    • Off - Cache is clean (no unwritten data).

  4. Reseat the expansion cable, and inspect it for damage. Is the expansion port status LED on?

    • Yes - Monitor the status to ensure there is no intermittent error present. If the fault occurs again, clean the connections to ensure that a dirty connector is not interfering with the data path.

    • No - Proceed to the next step.

  5. Move the expansion cable to a port on the controller enclosure with a known good link status. This step isolates the problem to the expansion cable or to the controller module expansion port. Is the expansion port status LED on?

    • Yes - User now know that the expansion cable is good. Return the cable to the original port. If the expansion port status LED remains off, user have isolated the fault to the controller module expansion port. Replace the controller module.

    • No - Proceed to the next step.

  6. Move the expansion cable back to the original port on the controller enclosure.

  7. Move the expansion cable on the drive enclosure to a known good expansion port on the drive enclosure. Is the expansion port status LED on?

    • Yes - user have isolated the problem to the drive enclosure port. Replace the expansion module.
    • No - Proceed to the next step.8. Replace the cable with a known good cable, ensuring the cable is attached to the original ports used by the previous cable. Is the host link status LED on?
    • Yes - Replace the original cable. The fault has been isolated.
    • No - It is likely that the controller module must be replaced.

top

Isolating remote snap replication faults

Click here to view the Isolating Remote Snap replication faults .

top

Resolving voltage and temperature warnings

  1. Check that all of the fans are working by making sure the Voltage/Fan Fault/Service Required LED on each power supply is off, or by using the SMU to check enclosure health status.

  2. In the lower corner of the footer, overall health status of the enclosure is indicated by a health status icon. For more information, point to the System tab and select View System to see the System panel. User can select from front, rear, and table views on the System panel. If user point to a component, its associated metadata and health status displays onscreen.

  3. Make sure that all modules are fully seated in their slots with latches locked.

  4. Make sure that no slots are left open for more than two minutes. If user need to replace a module, leave the old module in place until user have the replacement or use a blank module to fill the slot. Leaving a slot open negatively affects the airflow and can cause the enclosure to overheat.

  5. Make sure there is proper air flow, and no cables or other obstructions are blocking the front or rear of the array.

  6. Try replacing each power supply module one at a time.

  7. Replace the controller modules one at a time.

  8. Replace SFPs one at a time (MSA 2050 SAN).

Sensor locations

The storage system monitors conditions at different points within each enclosure to alert user to problems. Power, cooling fan, temperature, and voltage sensors are located at key points in the enclosure. In each controller module and expansion module, the enclosure management processor (EMP) monitors the status of these sensors to perform SCSI enclosure services (SES) functions.

The following sections describe each element and its sensors.

Power supply sensors

Each enclosure has two fully redundant power supplies with load-sharing capabilities. The power supply sensors described in the following table monitor the voltage, current, temperature, and fans in each power supply. If the power supply sensors report a voltage that is under or over the threshold, check the input voltage.

Description
Event/Fault ID LED Condition
Power supply 1
Voltage, current, temperature, or fan fault
Power supply 2
Voltage, current, temperature, or fan fault

Cooling fan sensors

Each power supply includes two fans. The normal range for fan speed is 4,000 to 6,000 RPM. When a fan speed drops below 4,000 RPM, the EMP considers it a failure and posts an alarm in the storage system event log. The following table lists the description, location, and alarm condition for each fan. If the fan speed remains under the 4,000 RPM threshold, the internal enclosure temperature may continue to rise. Replace the power supply reporting the fault.

Description
Location
Event/Fault ID LED Condition
Fan 1
Power supply 1
< 4,000 RPM
Fan 2
Power supply 1
< 4,000 RPM
Fan 3
Power supply 2
< 4,000 RPM
Fan 4
Power supply 2
< 4,000 RPM

During a shutdown, the cooling fans do not shut off. This allows the enclosure to continue cooling.

Temperature sensors

Extreme high and low temperatures can cause significant damage if they go unnoticed. When a temperature fault is reported, it must be remedied as quickly as possible to avoid system damage. This can be done by warming or cooling the installation location.

Description
Normal Operating Range
Warning Operating Range
Critical Operating Range
Shutdown Values
CPU temperature (internal digital thermal sensor)
2°C - 98°C
0°C - 1°C
99°C - 104°C
None
? 0°C
? 104°C
SAS2008 internal digital sensor
3°C - 112°C
0°C - 2°C
113°C - 115°C
None
? 0°C
? 115°C
Supercapacitor pack thermistor
0°C - 50°C
None
None
None
On board temperature 1
0°C - 70°C
None
None
None
On board temperature 2
0°C - 70°C
None
None
None
On board temperature 3
0°C - 70°C
None
None
None

When a power supply sensor goes out of range, the Fault/ID LED illuminates amber and an event is logged.

Description
Normal Operating Range
Power Supply 1 temperature
?10°C - 80°C
Power Supply 2 temperature
?10°C - 80°C

Power supply module voltage sensors

Power supply voltage sensors ensure that the enclosure power supply voltage is within normal ranges. There are three voltage sensors per power supply.

Sensor
Event/Fault LED Condition
Power supply 1 voltage, 12 V
< 11.00 V
> 13.00 V
Power supply 1 voltage, 5 V
< 4.00 V
> 6.00 V
Power supply 1 voltage, 3.3 V
< 3.00 V
> 3.80 V

top

Legal Disclaimer: Products sold prior to the November 1, 2015 separation of Hewlett-Packard Company into Hewlett Packard Enterprise Company and HP Inc. may have older product names and model numbers that differ from current models.

Provide feedback

Please rate the information on this page to help us improve our content. Thank you!
Document title: HPE MSA 2050 Storage - Troubleshooting
Document ID: emr_na-a00022126en_us-5
How helpful was this document?
How can we improve this document?
Note: Only English language comments can be accepted at this time.
Please wait while we process your request.