High availability overview

The High Availability (HA) feature has three components:
  • Redundant Management

  • OVSDB synchronization

  • Filesystem replication

Key goals of HA include:
  • Achieve five-nines (99.999%) availability of switching traffic through minimization of unplanned network outages.

  • Fault tolerant: No single active component failure will cause an outage.

  • Live replacement of hardware with minimal or no disruption.

Terminology:
  • MM: Abbreviation for management module

  • MM to MM link: Refers to the 10GbE-KR Ethernet link between two MMs

  • OVSDB: Abbreviation for Open vSwitch Database

  • Active MM: Management module that has control of the chassis

  • Standby MM: Backup management module for the active management module

  • JSON-RPC: Remote procedure call protocol encoded in JSON

Key parts of the HA feature include:
  • Network redundancy: Protocols and redundant network paths provide redundancy in the network, enabling traffic to continue flowing if a network link or network switch fails.

  • Hardware redundancy: Redundant hardware components (power supplies, fabric cards, management modules) allow continued switching traffic or system management in the event of a hardware failure or hardware maintenance. This functionality is supported through:
    • Fast failover (management failover)

    • Hot insert and removal (all field-replaceable hardware components)
    Redundancy of specific, field-replaceable hardware components includes:
    • Redundancy management (management modules), which is in charge of:

      • HA infrastructure

      • File synchronization

      • OVSDB synchronization

      • MM failover

      • Standby MM configuration

      • Software version update

      The Active MM controls infrastructure, files, and the database. If the Active MM is removed, all management passes to the Standby MM.

    • Fabric redundancy (fabric cards)

    • Network interface redundancy (line cards)

    • Power management (power supplies)

  • Software redundancy: Software (including daemons) provides redundancy in software by supporting one or more of the following methods:
    • Nonstop switching restart:
      • The daemon reads its last known state or the current hardware state from OVSDB.

      • The daemon adjusts its internal state to match the last known state.

      • There is no traffic interruption and no moment in time where the last known configuration is not in effect.

      • The daemon restarts fast enough to respond to protocols that require peer communication without timing out.

      • Examples include LACP, ACLS, TCAM entries, and MSTP.

    • Graceful restart:
      • Current state is still read from OVSDB. Traffic follows the rules of this state until the protocol has fully recovered.

      • Connections to other switches are re-established.

      • Current state is republished to peers, which can then respond back with adjustments.

      • Examples include routing protocols.

    • Full state reset:
      • Any non-default runtime state the daemon has in hardware or OVSDB is forced back to the default state.

      • Any connections are closed and have to be manually restarted.

      • This is primarily for user-facing daemons and features for which the default state does not have a large impact on traffic.

      • Examples include SSH, web server, TFTP, and CLI.