EIGRP in Clos Topologies

Link state protocols or BGP are typically used for underlay routing in data center fabrics, deployment of these protocols is Clos (aka spine and leaf) topologies are well documented. I decided to experiment with EIGRP to get a better understanding of its operation in such topologies. Although I don’t expect to ever see a production network using an EIGRP underlay, I think it’s totally usable in moderate scale 3 stage fabrics. I found a scalability concern with 5 stages of routers.

Demo Network

I built the network below in CML for testing. Each fabric link is configured with a /30 subnet and each leaf has a /32 loopback address to act as a VTEP for an overlay. This is a standard spine and leaf, aka 3 stage network.

What is the Underlay?

There is a “need” for layer 2 mobility within data center fabrics. The most popular solution right now is a routed network as the “underlay” and a MAC-in-IP “overlay” tunnel, allowing ethernet frames to be tunneled over the IP network. VXLAN is the most common overlay tunneling protocol. Usually, each leaf switch (the place where endpoints connect) will have a loopback interface that serves as the tunnel source and destination IP for MAC-in-IP tunneled traffic going to or coming from other leaf switches. This loopback is called a Virtual Tunnel Endpoint or VTEP. The underlay routing protocol needs to provide reachability between all VTEP addresses. Reachability for anything other than VTEP addresses is usually extraneous information.

EIGRP Stability and Scalability Techniques

In data center fabrics you typically do not want leaf nodes to be transit devices. The EIGRP stub feature not only helps achieve this goal, but also limits the query domain during network convergence. The default stub behavior is to only advertise connected routes, so this partially solves preventing transit traffic though the leaf layer. Another rule of EIGRP stub is that queries are not sent to stub routers. Making every leaf a stub means that the spines will not send queries downstream. Leaves can query spines, but spines cannot query leaf switches. This provides rapid convergence. The spine may receive a query, but it can instantly respond without having to query any other router.

Distribute lists are another technique to increase scalability and stability of the underlay. Recalling that we only need to provide reachability between VTEPs. A distribute list that only permits VTEP advertisement achieves this nicely. The exact same distribute list can be used on every node in the fabric. This way, leaf switches only learn VTEP addresses, no transit link prefixes are announced in the fabric.

Config example below. Every leaf has identical EIGRP configuration. The only difference for spines is that they are not configured as stub routers.

Leaf-1#show running-config | section eigrp
router eigrp CLOS
 !
 address-family ipv4 unicast autonomous-system 14
  !
  topology base
   distribute-list TEP_ONLY out 
  exit-af-topology
  network 172.16.0.0
  eigrp stub connected summary
 exit-address-family 
 
Leaf-1#show ip access-lists TEP_ONLY
Standard IP access list TEP_ONLY
    10 permit 172.16.128.0, wildcard bits 0.0.0.255 (6 matches)

The routing table from a leaf is shown below 3 equal cost multipath routes to each VTEP are available for forwarding.

Leaf-1#show ip route eigrp | begin Gateway
Gateway of last resort is not set

      172.16.0.0/16 is variably subnetted, 14 subnets, 2 masks
D        172.16.128.2/32 [90/1536640] via 172.16.0.9, 01:11:11, Ethernet0/2
                         [90/1536640] via 172.16.0.5, 01:11:11, Ethernet0/1
                         [90/1536640] via 172.16.0.1, 01:11:11, Ethernet0/0
D        172.16.128.3/32 [90/1536640] via 172.16.0.9, 01:11:11, Ethernet0/2
                         [90/1536640] via 172.16.0.5, 01:11:11, Ethernet0/1
                         [90/1536640] via 172.16.0.1, 01:11:11, Ethernet0/0
D        172.16.128.4/32 [90/1536640] via 172.16.0.9, 01:11:11, Ethernet0/2
                         [90/1536640] via 172.16.0.5, 01:11:11, Ethernet0/1
                         [90/1536640] via 172.16.0.1, 01:11:11, Ethernet0/0
D        172.16.128.5/32 [90/1536640] via 172.16.0.9, 01:11:11, Ethernet0/2
                         [90/1536640] via 172.16.0.5, 01:11:11, Ethernet0/1
                         [90/1536640] via 172.16.0.1, 01:11:11, Ethernet0/0
D        172.16.128.6/32 [90/1536640] via 172.16.0.9, 01:11:11, Ethernet0/2
                         [90/1536640] via 172.16.0.5, 01:11:11, Ethernet0/1
                         [90/1536640] via 172.16.0.1, 01:11:11, Ethernet0/0
D        172.16.128.254/32 [90/1536640] via 172.16.0.9, 01:11:11, Ethernet0/2
                           [90/1536640] via 172.16.0.5, 01:11:11, Ethernet0/1
                           [90/1536640] via 172.16.0.1, 01:11:11, Ethernet0/0
D        172.16.128.255/32 [90/1536640] via 172.16.0.9, 01:11:11, Ethernet0/2
                           [90/1536640] via 172.16.0.5, 01:11:11, Ethernet0/1
                           [90/1536640] via 172.16.0.1, 01:11:11, Ethernet0/0

Convergence

To demonstrate how the fabric converges around link failure, I shut a link between Border-1 and one of the spines.

The spine on the other end of the down link announces that the delay to the Border-1 VTEP is infinite, declaring that it cannot reach the address. This update is sent to all leaf switches. The update below was sent to Leaf-1. This particular update is sequence number 94.

Leaf-1 acknowledges the update by setting Ack # 94 in its next hello packet. The other leaf switches must also acknowledge.

The other event that occurred when Border-1 lost one of its uplinks is that it sent a query to the other two spines, hunting for the /30 prefix of its now shut down interface. The link subnet in this example is 172.16.0.80/30.

Both spines will acknowledge the query. Both spines then realize all their neighbors are stubs and cannot be queried. The spines send a query reply with delay infinity. The /30 transit prefix is deemed unreachable, and the network is fully converged around the link failure.

Query Acknowledgement in the Hello packet:  

Query reply marking the /30 as unreachable:

Not shown is the query reply being acknowledged by Border-1 via a Hello packet.

This minimal amount of updates and querying will likely easily scale to accommodate the largest 3 stage fabrics.

The 5 Stage Fabric Problem

I experimented with EIGRP in a 5-stage fabric. I thought I would be clever and use EIGRP stub with leak maps on transit nodes to enable a query free, huge scale EIGRP routed fabric. I found that in this configuration, stub routers would query other stub routers. This could pose as a theoretical scale limitation in larger EIGRP fabrics. I may write a separate article about this condition in the future.