Systems Admin

SQL Server FCI Part 8 of 13: Testing the Failover (Break Things on Purpose)

Eight parts of work get tested in 30 seconds. We create some baseline data on the active node, kill the active node hard, watch the cluster move SQL to the other node, and verify the data is still there + writeable. Always test failover BEFORE you need it. Finding it broken at 03:00 is the worst possible discovery.

Phase 1 — create baseline data on Node-01

SSMS connected to AOFCI from Node-01 with the Students database created and Employees table created with several sample rows inserted, the baseline test data that will survive the simulated failover
Phase 1 baseline. SSMS > AOFCI > new Database (Students) > new Table (Employees) > INSERT a few rows. This is the data we’ll verify after failover.

Node-01 is currently the active node. Open SSMS, connect to AOFCI:

CREATE DATABASE Students;
GO
USE Students;
GO
CREATE TABLE Employees (
    EmpID INT IDENTITY(1,1) PRIMARY KEY,
    Name NVARCHAR(100),
    Department NVARCHAR(50),
    HireDate DATE
);
GO
INSERT INTO Employees (Name, Department, HireDate) VALUES
    ('jdoe', 'Engineering', '2024-01-15'),
    ('Alice Smith', 'Marketing', '2024-03-22'),
    ('Bob Wilson', 'Sales', '2024-05-08');
GO
SELECT * FROM Employees;

Three rows. This is what we expect to see after failover.

Phase 2 — verify the starting state

Failover Cluster Manager Storage > Disks pane confirming Cluster Disk 1 is owned by Node-01 in the pre-failover state” /><figcaption>Pre-failover state — FCM Storage > Disks: Cluster Disk 1 owned by Node-01.</figcaption></figure>
<p>FCM > Storage > Disks: Cluster Disk 1 owned by Node-01.</p>
<figure class=FCM Roles pane showing the SQL Server (MSSQLSERVER) role with Owner Node Node-01, the starting position before the simulated outage
Roles: SQL Server (MSSQLSERVER) — Owner Node-01. Starting position.

Roles: SQL Server (MSSQLSERVER) owned by Node-01.

SSMS on Node-02 also connected to AOFCI showing the same Employees table data, confirming the FCI virtual identity works from any client regardless of which node currently owns the SQL service
Cross-check: open SSMS on Node-02 — connect to AOFCI. Query the table. Same data. (You’re hitting the SQL service over the network from N2 to N1 here.)
Side-by-side SSMS query result on Node-02 showing identical row counts to Node-01, the cross-check that establishes data consistency before the failover test
Same row counts on both views. Data consistency confirmed.

Cross-check from Node-02: open SSMS, connect to AOFCI, query Employees. Same three rows. (You’re hitting the SQL service on Node-01 over the network from Node-02 here — that proves the AOFCI virtual name resolves correctly.)

Phase 3 — kill Node-01

Windows Start menu Power option being clicked on Node-01 to initiate Shut Down, the deliberate outage that simulates a hardware or OS crash of the primary node
Phase 2: kill Node-01. Start > Power > Shut Down. (For a more brutal test, power off the VM hard from the hypervisor — that simulates a real crash.)

Sign in to Node-01. Start > Power > Shut Down.

For a more brutal test, instead power off the VM hard from the hypervisor — this simulates a real crash without graceful shutdown. The cluster behaviour is the same, but the heartbeat loss detection is faster (no graceful drain).

Node-01 mid-shutdown screen showing the OS shutting down, the moment when the cluster begins detecting the heartbeat loss
Node-01 going offline. Cluster begins detecting heartbeat loss.

Node-01 going offline.

Phase 4 — observe failover on Node-02

Quickly switch to Node-02. Open Failover Cluster Manager. Watch the Roles pane.

Failover Cluster Manager on Node-02 showing the SQL Server role status briefly Pending then transitioning to Running with Owner Node now Node-02, the cluster successfully reacting to the loss of Node-01
Within ~10-30 seconds: FCM on Node-02 shows the SQL role briefly Pending, then Running. Owner Node: Node-02. The cluster has automatically failed over.

Within ~10-30 seconds:

  1. Cluster detects Node-01 heartbeat loss (~5 sec)
  2. Quorum vote confirms Node-01 is gone
  3. Cluster forcibly takes ownership of the SAN disks (~10 sec)
  4. SQL Server service starts on Node-02 (~10-30 sec)
  5. AOFCI virtual IP and DNS get associated with Node-02

Roles status: briefly Pending, then Running. Owner Node: Node-02.

FCM Nodes pane showing Node-01 status Down (red icon) and Node-02 status Up, the visual confirmation that the cluster has accepted Node-01 as offline
Nodes pane: Node-01 Down (red), Node-02 Up.

Nodes pane: Node-01 Down (red icon).

Storage > Disks pane after failover with Cluster Disk 1 now owned by Node-02, the SAN ownership has been atomically transferred” /><figcaption>Disks: Cluster Disk 1 now owned by Node-02. The SAN ownership transferred atomically — no two-node simultaneous-write window.</figcaption></figure>
<p>Disks: Cluster Disk 1 now owned by Node-02.</p>
<figure class=File Explorer on Node-02 showing the SQL data drive (D: or wherever) now mounted and visible, the proof that Node-02 can now read/write the shared storage
File Explorer on Node-02 shows the data drive mounted. Node-02 can now read/write the shared storage that Node-01 owned 30 seconds ago.

File Explorer on Node-02 shows the data drive mounted — the SAN LUNs that Node-01 owned 30 seconds ago are now Node-02’s.

Phase 5 — verify data availability + writability

SSMS on Node-02 reconnected to AOFCI after the failover with SELECT * FROM Employees returning the original baseline rows, the read test confirming data is intact
Phase 3: data verification. SSMS on Node-02 > AOFCI > SELECT * FROM Employees. All baseline rows present.

Open SSMS on Node-02. Connect to AOFCI (same name — that’s the whole point). SELECT * FROM Students.dbo.Employees;all three rows present. Read works.

Same SSMS session executing INSERT INTO Employees adding a new row, the write test confirming the database is fully writeable on Node-02 even though Node-01 is dead
Write test: INSERT INTO Employees ....
Result pane showing the new row successfully inserted with row count 1 affected, the verification that SQL is fully operational on the failover node
Success. SQL is fully writeable on the failover node. The whole point of FCI proven.

Now write test: INSERT INTO Employees (Name, Department, HireDate) VALUES ('Failover Test', 'IT', '2026-05-10');1 row affected. Write works.

SQL is fully operational on Node-02 even though Node-01 is dead. This is the entire promise of FCI being delivered.

Phase 6 — recover Node-01

Node-01 powered back on with the Windows login screen visible, the recovery phase begins
Phase 4: recover Node-01. Power on the VM. Wait for boot.

Power on Node-01. Wait for boot.

Node-01 successfully booted and rejoined the domain showing the desktop after sign-in with the cluster service starting up
Node-01 boots, rejoins the domain, cluster service starts. About 60 seconds total.

Node-01 rejoins the domain. Cluster service starts on it automatically.

FCM Nodes pane after Node-01 recovery with both Node-01 and Node-02 showing status Up, the cluster is healthy again with full redundancy
FCM Nodes: both Up. Cluster healthy with redundancy restored.

FCM Nodes: both Up. Cluster healthy with redundancy restored.

Phase 7 — failback (or don’t)

FCM Roles pane showing the SQL Server role still owned by Node-02 even after Node-01 came back, the default no-auto-failback behaviour that prevents flapping
Roles: still owned by Node-02. Default behaviour: no auto-failback. Windows Cluster intentionally avoids flapping — admin verifies the recovered node is healthy before moving production traffic back.

Roles: still owned by Node-02, even after Node-01 came back.

Default behaviour: no auto-failback. Windows Cluster intentionally avoids flapping — admin verifies the recovered node is genuinely healthy before moving production traffic back. Imagine if N1 was crashing in a loop — auto-failback would trigger an outage every 5 minutes as SQL bounced between nodes.

SSMS on Node-01 (the recovered node) connecting to AOFCI and querying the Employees table to see the New Row inserted while Node-01 was offline, confirming sync is perfect
Final verification on Node-01: SSMS > AOFCI > SELECT * — the “New Row” inserted while N1 was down is visible. Sync is perfect because N2’s writes went to the same shared storage.

Verification on Node-01: SSMS > AOFCI > SELECT — the “Failover Test” row inserted while N1 was down is visible. Sync is perfect because Node-02’s writes went to the same shared storage that Node-01 now sees again.

FCM right-click context menu on the SQL Server role showing Move > Select Node menu item ready to perform a manual failback” /><figcaption>Optional: manual failback. Right-click role > Move > Select Node.</figcaption></figure>
<figure class=Move Clustered Role dialog with Node-01 selected as the target, the manual move operation that returns ownership to the original primary node when ready
Pick Node-01 as target. Click OK. Cluster moves SQL back to N1 in 30-60 seconds. (Or leave it on N2 — both nodes are equal.)

Optional manual failback: right-click the SQL role > Move > Select Node > Node-01 > OK. Cluster moves SQL back in 30-60 seconds. Or leave it on Node-02 — both nodes are equal.

RTO measurement

Recovery Time Objective — how long is the outage during failover? Typical components:

  • Heartbeat loss detection: ~5 seconds (configurable)
  • Quorum vote + disk reservation steal: ~10 seconds
  • SQL service start on new node: ~10-30 seconds (depends on database recovery)
  • DNS TTL for AOFCI: 60 seconds default (clients with cached DNS may briefly hit dead IP)

Total: 30-90 seconds typical for FCI failover. Tune for production SLA — e.g., shorten DNS TTL to 5 seconds if your application can’t tolerate 60-second blips.

Things that bite people in this part

SQL takes forever to come up on the failover node

If your databases have lots of unrecovered transactions (large in-flight workload at crash time), SQL spends time rolling forward / rolling back during recovery. Big databases = longer failover RTO. Mitigation: smaller transactions, more frequent checkpoints.

Failover happens but clients don’t reconnect

Most modern client libraries auto-retry on connection loss. Older drivers (e.g., really old ODBC) may need manual retry. Test with your actual application before declaring victory.

DNS cached

Clients with cached DNS may keep trying to reach the old IP for the duration of the TTL. Fix: shorten DNS TTL (60s default) or use a load-balanced VIP.

Auto-failback enabled by accident

FCM > Roles > right-click > Properties > Failover tab > “Allow failback”. If ticked, the cluster will move SQL back to its preferred owner immediately. Combined with a flapping node, this causes repeated outages. Default off — keep it that way unless you have a specific reason.

Test only graceful shutdown, not power-off

Graceful shutdown lets the cluster drain the role first — failover is faster and cleaner. Power-off simulates a real crash and tests the heartbeat-loss path. Test both. Production crashes don’t come with graceful shutdown.

Storage failure not tested

Killing a compute node tests compute failover. It does NOT test storage failure (SAN going down). Test that separately: simulate iSCSI Target VM going down and observe cluster behaviour. (Spoiler: cluster goes red, SQL stops; FCI doesn’t protect against shared-storage failure — that’s a different HA tier.)

What’s next

Two-node FCI proven to work. But what about scaling up? Part 9 prepares Node-03 (network config, OS prep, joining the cluster). See the full series at SQL Server Clustering pathway.

Leave a Reply