High availability overview

The High Availability (HA) feature has three components:

Key goals of HA include:

Minimize unplanned network outages (five 9s): 99.999% uptime in terms of switching traffic.
Fault tolerant: No single active component failure will cause an outage.
Live replacement of hardware with minimal or no disruption.

Terminology:

Key parts of the HA feature include:

Network redundancy: Protocols and redundant network paths provide redundancy in the network, enabling traffic to continue flowing if a network link or network switch fails.
Hardware redundancy: Redundant hardware components (power supplies, fabric cards, management modules) allow continued switching traffic or system management in the event of a hardware failure or hardware maintenance. This functionality is supported through:
- Fast failover (management failover)
- Hot insert and removal (all field-replaceable hardware components)
Redundancy of specific, field-replaceable hardware components includes:
- Redundancy management (management modules), which is in charge of:
  - HA infrastructure
  - File synchronization
  - OVSDB synchronization
  - MM failover
  - Standby MM configuration
  - Software version update
  The Active MM controls infrastructure, files, and the database. If the Active MM is removed, all management passes to the Standby MM.
- Fabric redundancy (fabric cards)
- Network interface redundancy (line cards)
- Power management (power supplies)

Software redundancy: Software, including daemons, can provide redundancy in software by supporting one or more of the following approaches:
- Nonstop switching restart:
  - The daemon reads its last known state or the current hardware state from OVSDB.
  - The daemon adjusts its internal state to match the last known state.
  - There is no traffic interruption and no moment in time where the last known configuration is not in effect.
  - The daemon restarts fast enough to respond to any peer communication without timing out.
  - Examples include LACP, ACLS, TCAM entries, and MSTP.
- Graceful restart:
  - Current state is still read from OVSDB. Traffic follows the rules of this state until the protocol has fully recovered.
  - Connections to other switches are re-established.
  - Current state is republished to peers, which can then respond back with adjustments.
  - Examples include routing protocols.
- Full state reset:
  - Any non-default runtime state the daemon has in hardware or OVSDB is forced back to the default state.
  - Any connections are closed and have to be manually restarted.
  - This is primarily for user-facing daemons and features for which the default state does not have a large impact on traffic.
  - Examples include SSH, web server, TFTP, and CLI.