High availability overview
Redundant Management
OVSDB synchronization
Filesystem replication
Minimize unplanned network outages (five 9s): 99.999% uptime in terms of switching traffic.
Fault tolerant: No single active component failure will cause an outage.
Live replacement of hardware with minimal or no disruption.
MM: Abbreviation for management module
MM to MM link: Refers to the 10GbE-KR Ethernet link between two MMs
OVSDB: Abbreviation for Open vSwitch Database
Active MM: Management module that has control of the chassis
Standby MM: Backup management module for the active management module
JSON-RPC: Remote procedure call protocol encoded in JSON
Network redundancy: Protocols and redundant network paths provide redundancy in the network, enabling traffic to continue flowing if a network link or network switch fails.
- Hardware redundancy: Redundant hardware components (power supplies, fabric cards, management modules) allow continued switching traffic or system management in the event of a hardware failure or hardware maintenance. This functionality is supported through:
Fast failover (management failover)
- Hot insert and removal (all field-replaceable hardware components)
Redundancy of specific, field-replaceable hardware components includes:Redundancy management (management modules), which is in charge of:
HA infrastructure
File synchronization
OVSDB synchronization
MM failover
Standby MM configuration
Software version update
The Active MM controls infrastructure, files, and the database. If the Active MM is removed, all management passes to the Standby MM.
Fabric redundancy (fabric cards)
Network interface redundancy (line cards)
Power management (power supplies)
- Software redundancy: Software, including daemons, can provide redundancy in software by supporting one or more of the following approaches:
- Nonstop switching restart:
The daemon reads its last known state or the current hardware state from OVSDB.
The daemon adjusts its internal state to match the last known state.
There is no traffic interruption and no moment in time where the last known configuration is not in effect.
The daemon restarts fast enough to respond to any peer communication without timing out.
Examples include LACP, ACLS, TCAM entries, and MSTP.
- Graceful restart:
Current state is still read from OVSDB. Traffic follows the rules of this state until the protocol has fully recovered.
Connections to other switches are re-established.
Current state is republished to peers, which can then respond back with adjustments.
Examples include routing protocols.
- Full state reset:
Any non-default runtime state the daemon has in hardware or OVSDB is forced back to the default state.
Any connections are closed and have to be manually restarted.
This is primarily for user-facing daemons and features for which the default state does not have a large impact on traffic.
Examples include SSH, web server, TFTP, and CLI.