Systems Admin

Multi-Location Active Directory, Part 4: How Replication Actually Works (KCC, ISTG, DSA, ESE, RPC)

The previous three posts in this series stood up the lab and built out the topology — Part 1 the why and the lab, Part 2 the headquarters location, Part 3 the branch DC via Install From Media. With the lab now running three DCs across two locations, this post takes a step back from the configuration walkthrough to cover what is actually happening under the hood when AD replicates: which components produce the topology, which components ship the bytes, and which components store and retrieve them. Knowing the cast lets the next post’s site-link configuration make sense rather than feel like clicking through dialogs.

What replication actually is

Active Directory is a multi-master directory: every DC holds a writable copy of the domain partition, every DC accepts writes, and every write made on one DC has to propagate to every other DC in the same partition. Replication is the process that makes that propagation happen. Without it, a password reset done on the headquarters DC would never reach the branch DC, a new user created in Surat-HO wouldn’t be findable from Delhi-BO, and the directory would split into per-DC islands within minutes of the first divergent write.

The shape of the propagation isn’t arbitrary. AD doesn’t broadcast every change to every DC; that wouldn’t scale past a handful of DCs. Instead, every DC subscribes to a small number of upstream partners (its “inbound neighbours” in repadmin /showrepl output) and pulls changes from them. The total set of these subscriptions is the replication topology — a graph of which DCs talk to which other DCs and in which direction. The topology is built and maintained automatically by two services running across the directory; the rest of this post is about who they are and what each one does.

The five components in the replication stack

Replication isn’t one thing. It’s a stack of components, each with a narrow responsibility, that compose into the end-to-end “a write here lands there” behaviour. From top to bottom:

  • KCC — Knowledge Consistency Checker. Builds the intra-location replication topology on every DC.
  • ISTG — one DC per location selected as the cross-location topology builder. Picks bridgehead servers and creates the cross-location connections.
  • RPC — the network protocol DCs use to talk topology and ship the actual updates.
  • DSA — Directory System Agent. The interface that every other AD component goes through to read and write the directory.
  • ESE — Extensible Storage Engine. The on-disk database engine that stores ntds.dit.

Each layer has a clear input, a clear output, and a clear failure mode. Knowing them means “replication is broken” gets decomposed into “the KCC can’t build the topology because…” or “the RPC layer is failing because…” instead of remaining a vague black box.

Knowledge Consistency Checker (KCC)

The KCC runs on every DC. Its job is to build the intra-location replication topology — the connections between DCs in the same AD location. It runs every 15 minutes by default; it can be forced manually with:

repadmin /kcc

What the KCC does in detail: it reads the configuration partition from the local AD database (via the DSA, covered below) to learn the current location and DC layout. It computes a topology that satisfies AD’s rules — every DC has at least two upstream neighbours where possible, no DC is more than three hops from any other DC in the same location, and the topology forms a ring rather than a chain. It then writes the resulting topology to AD as connection objects — objects that name a source DC and a destination DC and represent “DC X pulls from DC Y.” The connection objects are what repadmin /showrepl reads back when it shows you the inbound neighbours.

The KCC is per-DC and reactive. If a DC fails, the KCC on the surviving DCs notices within 15 minutes (or sooner if forced) and rebuilds connection objects to route around the failure. If a DC is added, the KCC detects it on the next run and adds it to the topology. Manual intervention is rarely needed; the connection objects you see in dssite.msc are usually all KCC-generated, and editing them by hand is the wrong move 95% of the time (the KCC will reset them on its next run).

When to manually run repadmin /kcc

The 15-minute default is fine for steady-state operation. Force a run when you’ve just made a topology change (added a location, added a DC, moved a DC between locations) and don’t want to wait for the next scheduled tick — the new connections show up immediately and replication starts on the new path within seconds.

Inter-Site Topology Generator (ISTG)

The ISTG is the cross-location counterpart to the KCC. Where the KCC handles intra-location connections, the ISTG handles connections that cross AD locations. One DC per location is elected as the location’s ISTG — by default, the first DC promoted into the location, with automatic failover if that DC goes down.

The ISTG’s job is twofold:

  • Pick bridgehead servers. A bridgehead is the DC in a location that handles all the inbound replication traffic from other locations. By concentrating cross-location traffic on one (or a few) DCs per location, AD avoids fan-out across every DC and link — which would saturate the WAN.
  • Build cross-location connection objects. Once bridgeheads are chosen, the ISTG creates the connection objects that wire bridgehead-to-bridgehead. Those connection objects are what actually carry replication updates between Surat-HO and Delhi-BO.

The ISTG considers the cost values you set on site links, the schedule windows you configured, and the available DCs in each location. It picks a bridgehead that minimises total cost across the topology graph. In a two-location lab the choices are trivial — SRT22-DC01 is the Surat-HO ISTG and bridgehead, DEL22-DC03 is the Delhi-BO ISTG and bridgehead because it’s the only DC there. In a 50-location enterprise, the ISTG’s graph computation is the difference between “replication converges in minutes” and “replication converges in days.”

KCC vs ISTG — the scope distinction

Component Scope What it produces How often
KCC One per DC Intra-location connection objects Every 15 min, on every DC
ISTG One per location Cross-location connection objects + bridgehead selection Every 15 min, on the elected DC only

Directory System Agent (DSA)

The DSA is the API every other AD component uses when it needs to read or write the directory. It runs as ntds.dll, loaded inside the LSA process (lsass.exe) on every DC. The KCC, the ISTG, ADUC, the LDAP client library, the authentication subsystem — all of them go through the DSA when they need to read a configuration partition object, write a user-attribute change, or enumerate the contents of a container.

From the administrator’s perspective, the DSA is invisible. You don’t configure it, monitor it, or restart it independently of the AD DS service. Its name shows up in two places: the DSA Options line in repadmin /showrepl output (which reports global flags like IS_GC) and the DSA object GUID line on the same output (which is the immutable identifier for this DC’s nTDSDSA configuration object). When AD logs replication errors, the source identifier is usually the DSA object GUID, not the friendly DC name — useful to know when reading event logs.

Extensible Storage Engine (ESE)

The ESE is the database engine that stores the AD database file (%SystemRoot%\NTDS\ntds.dit). It’s the same engine Microsoft uses for Exchange, the Windows Search index, and a handful of other Windows components — a transactional, indexed, B-tree-based store with crash recovery via write-ahead logs.

From the replication perspective, the ESE matters because it’s where every change ultimately lands. A password reset on a Surat DC isn’t real until the ESE on that DC has committed the corresponding row update to ntds.dit. The transaction log files (edb.log, edb*.log) in the same NTDS folder are how the ESE recovers from a power loss without corrupting the database; they’re also why the ESE’s “circular logging” default doesn’t allow point-in-time backup recovery the way Exchange’s ESE does.

The ESE is the layer where database-level corruption shows up — the symptoms are the “NTDS Database” event-log entries with messages like “The database engine is unable to read the database” or “Inconsistent log files detected.” Recovery options at that point go through ntdsutil (offline defragmentation, semantic database analysis, integrity check). The ESE itself doesn’t expose any direct configuration; you tune through registry values under HKLM\System\CurrentControlSet\Services\NTDS\Parameters if at all.

Directory Replication Service RPC Protocol

RPC (Remote Procedure Call) is the network protocol AD uses for replication. When SRT22-DC01 needs to send updates to DEL22-DC03, it opens an RPC connection to the destination’s DSA endpoint, authenticates via Kerberos using the source DC’s computer account, and streams the updates over the wire. The DSA on the destination side commits them to the local ESE.

RPC isn’t a single port. It uses TCP 135 (the endpoint mapper) for the initial “which port should I actually talk to?” lookup, then a port from the dynamic high range (49152–65535 by default on Windows Server 2008+; 1024–5000 on older releases) for the actual data exchange. This is why “just open TCP 135” on a firewall doesn’t make replication work — you need either the full dynamic range opened or a registry pin that fixes the AD replication port to a single value (HKLM\System\CurrentControlSet\Services\NTDS\Parameters\TCP/IP Port).

The other transport AD supports for inter-location replication is SMTP, which the ISTG can use for the configuration and schema partitions over a site link configured for SMTP transport. SMTP transport is rare in modern environments — everyone uses RPC even between locations — but it’s why repadmin /showrepl output sometimes says “via SMTP” instead of “via RPC” on legacy installations.

Putting it together: the path of a single write

A user’s password is reset by a helpdesk admin connected to SRT22-DC01. What happens, layer by layer, until DEL22-DC03 has the new hash:

  1. Write lands at SRT22-DC01. ADUC (or the password-reset cmdlet) calls into the DSA, which validates the change against the schema and ACLs, and hands the row update to the ESE. The ESE writes the new password hash to ntds.dit and assigns the change a new USN (Update Sequence Number) within the DC’s own counter.
  2. Replication notification fires. The DSA records that an attribute on this user has changed since the last replication checkpoint with each replication partner. By default, intra-location partners are notified within 15 seconds; cross-location partners are notified on a schedule the ISTG and site link configure (Part 5).
  3. SRT22-DC02 (intra-location partner) pulls. The KCC-generated connection object on SRT22-DC02 names SRT22-DC01 as a source. SRT22-DC02’s DSA opens an RPC connection to SRT22-DC01’s DSA, authenticates, asks for “changes since USN N,” receives the password-hash update over RPC, and commits it to its own ESE. End-to-end intra-location convergence: usually under 30 seconds.
  4. DEL22-DC03 (cross-location bridgehead) pulls. The ISTG-generated connection object on DEL22-DC03 names a Surat-HO bridgehead (SRT22-DC01) as a source. The pull happens on the schedule set by the site link — typical default is every 180 minutes (3 hours), tunable down to 15 minutes or up to multiple hours per the deployment’s WAN-traffic budget. When the pull fires, the same RPC-then-DSA-then-ESE flow happens.
  5. Convergence is complete. All three DCs in the lab now hold the new password hash. A subsequent authentication against any of them succeeds.

Things that bite people in production

The KCC silently rebuilds your manual connection edits

It’s tempting to manually create or modify connection objects in dssite.msc when you’ve got a specific routing in mind. The KCC will overwrite those edits within 15 minutes — manual connections are flagged differently from KCC-generated ones, but only if you tick the “Generated by the system” checkbox-equivalent and even then the KCC may regenerate parallel connections that defeat the manual layout. Lean on site links + costs to influence the topology, not direct connection edits.

ISTG election failures look like a stuck topology

If the elected ISTG goes offline and the failover doesn’t happen cleanly (a Bug-Check on the bridgehead, a network partition that takes out the ISTG without taking down the rest of the location), cross-location connection objects don’t get rebuilt and replication stalls between locations until the ISTG election completes. repadmin /istg reports the current ISTG per location; if it’s wrong, force the election by promoting another DC in the same location or by running repadmin /options +DISABLE_INBOUND_REPL on the broken ISTG to take it out of the running.

RPC dynamic-port range vs firewall is the most common replication failure

Both DCs talk to TCP 135 fine, then the actual replication call fails with error 1722 (RPC unavailable) because the dynamic port allocated by the endpoint mapper isn’t open on the firewall between the DCs. Either pin the port via HKLM\System\CurrentControlSet\Services\NTDS\Parameters\TCP/IP Port (DWORD value, set to a fixed port number, restart NTDS) and open just that, or open the full 49152–65535 range. Half-configured firewalls are responsible for the textbook “intra-location replication works, cross-location doesn’t” failure mode.

ESE-level corruption is recoverable but not automatic

If ntds.dit shows ESE-level corruption (event 1168 or 1142), the recovery sequence is: stop NTDS, run ntdsutil > activate instance ntds > files > integrity, then esentutl /p if the integrity check fails. Get a backup before any of this; esentutl /p is destructive (drops corrupted records). For a DC that’s part of a healthy multi-location replica set, the better recovery is usually to demote the corrupted DC, clean up its metadata, and re-promote with IFM from a healthy partner.

The DSA object GUID never changes; the invocation ID does

Both appear in repadmin /showrepl. The DSA object GUID identifies the DC; it’s baked into the nTDSDSA object at promotion and persists for the life of the DC. The invocation ID identifies the database instance and changes whenever the DC is restored from backup. Confusing the two is a common mistake when reading replication failure logs — an invocation-ID change is a serious signal (USN rollback risk); a DSA object GUID stays static.

Where this fits in the series

This was the conceptual interlude. Part 5 returns to the configuration walkthrough — tuning the site link cost and schedule between Surat-HO and Delhi-BO, configuring bridgehead servers explicitly where the auto-pick isn’t the right answer, and verifying cross-location convergence under realistic conditions. Part 1 covered the why; Part 2 the headquarters; Part 3 the branch with IFM. For the per-DC health check, see read repadmin /showrepl output; the Active Directory pathway covers the rest of the surface area.

Leave a Reply