SQL Always On Availability Groups Part 7: Manual Failover Testing

The AG is healthy and replicating (verified in Part 6). The whole point of building this is failover — so we exercise it now. Three tests in this part:

Manual failover via the Start Failover Wizard — SQL-NODE-01 to SQL-NODE-02, then back. Zero-data-loss role swap.
Restart the primary VM — watch the cluster auto-promote the secondary. Verify the role does NOT swing back when the original primary returns.
Hard shutdown the primary VM — same automatic promote, faster, more abrupt.

Failover testing is the only way to know your AG actually works. Don’t skip it — finding out at 3am that automatic failover never worked is a bad day.

Test 1: manual failover SQL-NODE-01 → SQL-NODE-02

On SQL-NODE-02 the Start Failover Wizard option being clicked from the AG-DEMO availability group context menu — On **SQL-NODE-02**, expand AG-DEMO > right-click > **Failover…**

On SQL-NODE-02, in SSMS Object Explorer expand AG-DEMO > right-click > Failover…

Why run the wizard from the target side? You can run it from either node, but starting from the node you want to BECOME primary makes the wizard simpler — it pre-selects that node.

Failover Wizard introduction page on SQL-NODE-02 with Next button highlighted — Welcome > Next.

Welcome > Next.

Select New Primary Replica page with SQL-NODE-02 selected as the new primary instance for the manual failover — Select New Primary Replica: pick **SQL-NODE-02**. Status should say *No data loss* (because we’re in synchronous commit).

Select New Primary Replica: pick SQL-NODE-02. Status should say No data loss. This is because we’re in synchronous commit mode — the secondary has every transaction the primary has committed.

If you see Data loss or Synchronizing, do NOT failover. The secondary is behind on log records and a failover would lose data. Wait for sync to catch up first.

Connect to Replica step with the Connect button used to authenticate to SQL-NODE-02 before promoting it to primary — Connect to Replica: **Connect** to authenticate. Next.

Connect to Replica: Connect. The wizard verifies it can authenticate to the target replica.

Summary page of the manual failover wizard with the Finish button about to be clicked to commit the role swap — Summary > **Finish**.

Summary > Finish.

Results page on SQL-NODE-02 showing the failover succeeded and SQL-NODE-02 is now the new primary instance — Results: success. **SQL-NODE-02 is now PRIMARY.** The listener IP follows.

Results: success. SQL-NODE-02 is now PRIMARY. The listener IP moved with it — any apps connected via AG-LISTENER,1433 are now talking to SQL-NODE-02 transparently.

Swap back: manual failover SQL-NODE-02 → SQL-NODE-01

Now on SQL-NODE-01 the Start Failover Wizard being launched again to swap primary back from SQL-NODE-02 to SQL-NODE-01 — Now on **SQL-NODE-01**, run the same wizard to swap back: AG-DEMO > **Failover…**

On SQL-NODE-01, run the same wizard to swap back: AG-DEMO > Failover…

Failover Wizard introduction page on SQL-NODE-01 second iteration with Next button shown — Welcome > Next.

Welcome > Next.

Select New Primary Replica page during the swap-back showing SQL-NODE-01 selected as the new primary — Select New Primary Replica: **SQL-NODE-01**. Next.

Select New Primary Replica: SQL-NODE-01. Next.

Summary page of the second failover wizard run with Finish about to be clicked to swap primary back — Summary > **Finish**.

Summary > Finish.

Results page confirming the swap-back succeeded and SQL-NODE-01 is the new primary again — Success. SQL-NODE-01 is primary again. We’ve proved bidirectional manual failover works.

Success. SQL-NODE-01 is primary again. Bidirectional manual failover confirmed.

AG dashboard on SQL-NODE-01 confirming the new primary instance is now SQL-NODE-01 after manual failover round trip — Dashboard from SQL-NODE-01 confirms **Primary = SQL-NODE-01**.

The dashboard from SQL-NODE-01 confirms.

Confirm writes work after failover

On the now-primary SQL-NODE-01:

USE demodb;
INSERT INTO dbo.Employees VALUES (5, 'Frank', 'Cloud');

New Query window on SQL-NODE-01 running a SELECT to view the rows including the newly inserted data — SELECT confirms the new row.

SELECT confirms the row landed.

Test 2: restart the primary VM

Now we test the automatic failover path — what happens when the primary disappears unexpectedly.

Hyper-V Manager being used to restart the SQL-NODE-01 primary virtual machine for failover behavior testing — **Test 1: restart the primary VM.** In Hyper-V Manager > right-click SQL-NODE-01 > Restart. Watch what happens to the AG.

In Hyper-V Manager on the host, right-click SQL-NODE-01 > Restart. This sends an OS-level restart command (graceful shutdown + boot back).

SSMS on SQL-NODE-02 showing a connection error to SQL-NODE-01 then automatic failover with NODE-02 promoted to primary — On SQL-NODE-02, SSMS shows a brief connection error to SQL-NODE-01 — then the cluster detects primary loss and **automatically promotes SQL-NODE-02 to primary**. The whole thing takes ~10 seconds.

Switch to SQL-NODE-02’s SSMS. Within ~10 seconds:

Connection to SQL-NODE-01 fails
The cluster detects the primary node is unreachable
SQL-NODE-02 is auto-promoted to primary
The listener IP moves to SQL-NODE-02

You may see a brief popup error in SSMS as it loses its current connection — reconnect.

After SQL-NODE-01 finishes restarting all SQL services started successfully on the rebooted node — Once SQL-NODE-01 finishes its restart, all SQL services come back up.

Eventually SQL-NODE-01 finishes its restart cycle. All its SQL services come back up.

AG dashboard view confirming SQL-NODE-02 remains primary even after SQL-NODE-01 has restarted and rejoined as secondary — But SQL-NODE-02 stays primary. **The role does NOT auto-swing back when the original primary returns.** SQL-NODE-01 rejoins as a secondary.

Important behaviour: SQL-NODE-02 stays primary. The role does NOT auto-swing back to SQL-NODE-01. SQL-NODE-01 rejoins the AG as a secondary.

This is by design. AG won’t fail back automatically because doing so would cause an immediate second outage as soon as the original primary recovers. If you want to return SQL-NODE-01 to primary, you do it manually with the Failover Wizard.

Verify replication continues across the failover

New Query on SQL-NODE-02 running an INSERT against demodb proving writes succeed on the new primary — INSERT on the new primary (SQL-NODE-02) works: `INSERT INTO dbo.Employees VALUES (6, 'Grace', 'Sec');`

INSERT on the new primary (SQL-NODE-02) succeeds:

INSERT INTO dbo.Employees VALUES (6, 'Grace', 'Sec');

New Query on SQL-NODE-02 running a SELECT to confirm the inserted rows are visible — SELECT confirms.

SELECT on SQL-NODE-02 confirms.

New Query on SQL-NODE-01 secondary running a SELECT to confirm the latest rows have replicated to the readable secondary — On SQL-NODE-01 (now secondary), SELECT against demodb returns the new row — confirming replication continued through the failover and back.

On SQL-NODE-01 (now secondary), SELECT returns the new row too — replication continued seamlessly through the failover and is now flowing the other direction.

Test 3: hard shutdown the primary VM

Hyper-V Manager being used to fully shut down the SQL-NODE-02 primary virtual machine to trigger automatic failover — **Test 2: hard shutdown the primary VM.** Hyper-V Manager > right-click SQL-NODE-02 > Shut Down (not Restart). This simulates a node hard-down event.

Hyper-V Manager > right-click SQL-NODE-02 > Shut Down (not Restart). This is a more abrupt loss event — the OS gets a shutdown signal but the VM is going down for real, not coming back on its own.

AG dashboard or SSMS on SQL-NODE-01 showing automatic promotion to primary after the SQL-NODE-02 shutdown — Cluster detects loss > **SQL-NODE-01 auto-promotes back to primary**. Faster than restart because there’s no graceful shutdown wait.

Cluster detects loss faster than restart (no graceful shutdown wait). SQL-NODE-01 auto-promotes back to primary.

Hyper-V Manager being used to start the SQL-NODE-02 virtual machine back up after the shutdown failover test — Hyper-V Manager > Start the SQL-NODE-02 VM.

Manually start the SQL-NODE-02 VM in Hyper-V Manager.

SQL services on SQL-NODE-02 starting up after the VM has booted back into the running state — After SQL-NODE-02 boots, all SQL services come up automatically.

SQL services start up automatically as the OS boots.

AG dashboard confirming SQL-NODE-01 is still primary even after SQL-NODE-02 has rejoined the cluster as secondary — AG dashboard: SQL-NODE-01 stays primary. SQL-NODE-02 rejoins as secondary, catches up on the log shipped during its absence.

AG dashboard: SQL-NODE-01 stays primary. SQL-NODE-02 rejoins as secondary and catches up on the log records that accumulated during its absence. Sync state goes from Synchronizing back to Synchronized within seconds.

On SQL-NODE-02 SSMS confirming SQL-NODE-01 is the primary instance from the secondary perspective end of failover testing — On SQL-NODE-02 (back as secondary), the dashboard confirms the topology and we’re back to a steady state. **End-to-end failover testing complete.**

On SQL-NODE-02 (now secondary again), the dashboard confirms the topology. Three failover tests passed: manual round-trip, restart auto-failover, shutdown auto-failover.

Things that bite people

Failover wizard greys out “No data loss” option

The secondary is behind on log records. Wait for synchronization to catch up. If it never catches up, the network or the secondary disk is too slow — investigate.

Automatic failover doesn’t happen on primary loss

Replica is configured with Failover Mode = Manual, not Automatic. Or the secondary is in asynchronous commit mode (auto failover requires sync mode). Or both nodes lost quorum (e.g. the witness died at the same time). Check Failover Cluster Manager > cluster events.

Failover triggers but listener doesn’t move

Cluster network resource for the listener is misconfigured, or the secondary node can’t bind to the listener IP (subnet mismatch). Check the Listener resource in FCM.

Apps don’t pick up the new primary

Apps are connecting to SQL-NODE-01 directly instead of AG-LISTENER. Or the connection string is missing MultiSubnetFailover=True (required for cross-subnet AGs). Always use the listener.

Manual failover causes Synchronizing forever

You manually failed over before sync caught up. SQL allows manual failover with data loss but warns you. If you proceed, the new primary’s log diverges from the old primary’s and reseeding may be needed.

Tested failover in production with real workload

Don’t test failover for the first time on Friday afternoon with users on the system. Build the lab (this series), test failover here. When you replicate to production, schedule a maintenance window for the first failover test.

You’ve got a working AG — now what?

You now have a fully functional SQL Always On Availability Group, end-to-end:

Two SQL Server 2022 nodes on Windows Server 2025
Backed by a Windows Failover Cluster with file share witness
One database (demodb) replicating synchronously between both nodes
Readable secondary (Enterprise / Evaluation feature)
Listener for transparent application connectivity
Manual + automatic failover both verified working

From here, the things to learn next:

Backup strategy — AG changes how backups work; you can offload to the secondary
Multi-database AGs — group related databases together so they fail over as a unit
Distributed AGs — AG of AGs, for cross-region DR
Read-only routing — route reporting connections to the secondary automatically via connection-string ApplicationIntent
Monitoring — sys.dm_hadr_* DMVs, Extended Events, alerting on dashboard yellow/red

The full pathway is at SQL Availability Groups. Thanks for following along.

Tags: #Always On #Availability Groups #Failover #High Availability #SQL Server #Windows Server 2025