Systems Admin

Two-Node Hyper-V Failover Cluster Part 14 of 15: Test Failover (Live Migration + Auto-Failover)

Cluster built, HA VM running. Now we prove it works under both planned and unplanned conditions. Phase 1: Live Migration — planned move from Node-01 to Node-02 with zero downtime. Phase 2: kill Node-02 deliberately and watch the cluster auto-fail the VM back to Node-01. Always test failover BEFORE you need it.

Setup — manage both nodes from one Hyper-V Manager

Hyper-V Manager on Node-01 showing the HA VM created in Part 13, the starting state before the failover test
Starting state. Node-01 owns the HA VM created in Part 13. Hyper-V Manager confirms.

Starting state. Node-01 owns the HA VM from Part 13. Hyper-V Manager confirms.

Hyper-V Manager right-click on the root with Connect to Server menu item to add Node-02 to the management view
Right-click Hyper-V Manager root > Connect to Server to add Node-02.

Right-click Hyper-V Manager root > Connect to Server. Add Node-02 so you can see both hosts in one console.

Connect to Computer dialog with Node-02 entered as the second host to manage
Enter Node-02 > OK.

Enter Node-02 > OK.

Hyper-V Manager showing Node-02 in the inventory with no VMs running on it currently
Hyper-V Manager now shows both nodes. Node-02 currently has no VMs.

Hyper-V Manager now shows both nodes. Node-02 currently has no VMs — the HA VM lives on Node-01.

Phase 1 — Live Migration (planned move, zero downtime)

Failover Cluster Manager Roles pane with right-click on the HA VM showing Move > Live Migration > Select Node menu” /><figcaption>Phase 1 (planned): FCM > Roles > right-click HA VM > <strong>Move</strong> > <strong>Live Migration</strong> > Select Node.</figcaption></figure>
<p>FCM > Roles > right-click HA VM > <strong>Move</strong> > <strong>Live Migration</strong> > <strong>Select Node</strong>.</p>
<figure class=Move Clustered Role dialog with Node-02 selected as the live migration target
Pick Node-02 > OK.

Pick Node-02 > OK.

FCM showing the VM in Live Migrating state during the memory copy phase
Status: Live Migrating. Memory + state being copied to Node-02. Connections preserved.

Status: Live Migrating. Behind the scenes:

  1. Cluster Service starts copying VM memory pages to Node-02
  2. Memory copy iterates as the VM keeps running on Node-01 — only changed pages re-copy
  3. When delta is small enough, brief pause (<1 sec): final memory state + CPU state copied
  4. VM resumes on Node-02 — clients don’t notice
After live migration completed showing Owner Node now Node-02, and Hyper-V Manager on Node-02 showing the HA VM running
Done. Owner: Node-02. Hyper-V Manager on Node-02 shows the HA VM running. Zero downtime — live migration is the gold standard for planned moves (patching, maintenance).

Done. Owner: Node-02. Hyper-V Manager on Node-02 shows the HA VM running. Zero downtime. Active TCP connections, in-flight transactions, all preserved.

Hyper-V Manager on Node-01 after migration showing no VMs (the HA VM has moved to Node-02)
Hyper-V Manager on Node-01: no VMs. The HA VM successfully moved.

Hyper-V Manager on Node-01: no VMs. The HA VM successfully migrated.

Phase 2 — auto-failover (crash test)

Now the unplanned scenario. The VM is on Node-02. We kill Node-02 deliberately and verify the cluster reacts correctly.

Node-02 being shut down to simulate a node failure while the HA VM is running on it
Phase 2 (crash test): shut down Node-02 while the HA VM is running on it. Cluster will detect and react.

Shut down Node-02. (For a more brutal test, power-off the VM hard from the hypervisor — that simulates a real crash without graceful shutdown.)

FCM showing the VM auto-detecting Node-02 outage and entering Live Migrating state to fail back to Node-01
FCM detects Node-02 down. VM enters Live Migrating state — cluster auto-fails the VM back to Node-01.

Cluster Service detects Node-02 down within ~5-10 seconds (heartbeat loss). The VM’s cluster role is now “orphaned” — needs a new owner. Cluster picks Node-01 (the only surviving node) and starts the VM there.

FCM shows the VM in Live Migrating state during recovery.

After auto-failover completed showing the VM running again on Node-01
Auto-failover complete. VM running on Node-01 again. (Auto-failover incurs ~30-60 sec downtime — not zero like Live Migration — because the cluster has to detect failure first and then cold-start the VM.)

Auto-failover complete. VM running on Node-01 again.

Important difference vs Live Migration: auto-failover is a cold start — the VM was in the middle of running on Node-02 when Node-02 died. The VM’s memory state was lost. Cluster cold-starts the VM from its last on-disk state on Node-01. Active connections drop, in-flight transactions roll back. Typical downtime: 30-60 seconds.

Phase 3 — recover Node-02

Node-02 being powered back on to restore cluster redundancy
Power Node-02 back on. Wait for boot.

Power Node-02 back on. Wait for boot.

FCM Nodes pane showing both Node-01 and Node-02 Up after recovery, cluster fully healthy with redundancy restored
Both nodes Up. Cluster redundancy restored. End state: HA VM survived a deliberate node loss with brief automatic recovery. The cluster works as designed.

Both nodes Up. Cluster redundancy restored. End state: HA VM survived a deliberate node loss with brief automatic recovery.

Live Migration vs Auto-Failover — what’s the difference?

Aspect Live Migration (planned) Auto-Failover (crash)
Trigger Admin clicks Move Node failure detected
Memory state Copied to target before switch Lost — cold start on new node
Downtime Zero (sub-second pause) 30-60 seconds typical
TCP connections Preserved Dropped — clients reconnect
Use for Patching, maintenance, load balancing Crashes, hardware failure

Use Live Migration whenever you have a planned reason to move a VM — patching, hardware maintenance, balancing load. Use auto-failover only when you have to (which is the entire point of clustering).

Things that bite people in this part

Live Migration fails: “Operation timed out”

Memory copy is slower than memory change rate. The VM is changing memory faster than the cluster can copy it — never converges. Common with memory-heavy workloads (large SQL, busy web servers). Mitigation: schedule the migration during low-activity windows, or use Quick Migration (saves state to disk first — a few seconds of downtime instead of zero).

VM doesn’t auto-failover

Check FCM > Roles > VM Properties > Failover tab. Verify “Allow failover” is enabled and Maximum failures isn’t too low. If the VM crashed multiple times in a short window, the cluster may stop trying.

30-second downtime feels too long

That’s the cost of cold start. To reduce: smaller VMs, faster storage (NVMe), faster cluster heartbeat tuning. To eliminate: switch to a different HA model (e.g., load balancer + multiple VM replicas, application-level HA).

Auto-failover happens but VM doesn’t start

Usually means the VHDX path isn’t accessible from the surviving node. CSV path should be the same on every node. Check that C:\ClusterStorage\Volume1 has the VM files.

Forgot to test before production

Most common failure pattern across the industry. Cluster looks fine. Patches applied. Failover never tested. Real outage hits at 03:00 — failover doesn’t work. Test failover quarterly.

What’s next

Failover proven. Part 15 covers expanding cluster storage — adding new LUNs and bringing them under cluster ownership. See the full series at Hyper-V Failover Clustering pathway.

Leave a Reply