Back when I was a young engineer, "stretching networks" mostly meant either a Layer 2 circuit, or something like Cisco OTV. Nowadays, we have NSX-T. In this blog, I'm going to talk about the 5 things you absolutely should be thinking about before deploying (or even designing) NSX-T Federation.
What is NSX-T Federation?
In a nutshell, Federation is a feature within NSX-T that allows you to stretch networks and security policies across physical locations - and manage them under one umbrella, or more specifically, one NSX-T install (sort of - we'll talk about this later). Take a look at the diagram below.
As you can see, we have two locations: Miami, and Atlanta. With federation, there are NSX-T components at both sites, and we've got some networks stretched across sites, while others we've opted to keep local to their respective sites. The same is possible with security policy- we can implement some firewall rules in one location, and some in the other, or across both.
OK, so you get the idea about what federation is - but why would you want to do it? Some of the advantages of this design include:
Simplified Disaster Recovery
One of the biggest challenges with Disaster Recovery (DR), is having to re-IP workloads when you failover. This hinders the whole idea of having a truly automated DR plan. By using Federation, you can make sure that the same subnet - say 192.168.5.0/24, is available at both Site-A and Site-B. By doing this, when you have a failover event, your VMs can migrate (or be spun up) at your standby site, and they'll be on the same subnet as they originally were. Taking this a step further, you can use SRM to orchestrate the movement of VMs, while NSX makes sure that your VMs land on the proper networks.
Running Active/Active workloads in both datacenters
This is very similar to the DR scenario - but in this case, we want VMs to be active at both sites. In many cases this means there'll be active workloads on the same network stretched across both sites, but with NSX you have the choice of what networks you stretch, so you're not forced into a corner here. Either way, the point is that you can now do an Active/Active setup across datacenters, which isn't super easy without NSX.
Simplified Firewall Rule management
While many organizations choose to implement Federation for the above reasons, some opt to implement it to simplify their firewall policies. If you have multiple locations, and/or a public cloud presence (such as native AWS or native Azure), you're managing at least 2-3 different firewall rule sets and vendors. With NSX, you can implement Federation so that you can write a firewall rule once, and it'll apply everywhere. An example of this could be "Allow port 80 to all of the servers in the 'web' group." If you had a VM in Site A that was in the "web" group, it would inherit this firewall rule. Likewise, if you have another at Site B, it would also inherit the same firewall rule.
5 Things to know before you implement Federation
Now that we understand the "what" and "why" of Federation, let's get into my top 5 tips. By the way, if you find this helpful, I actually made an entire course on Federation where I walk you through all of these design decisions AND deploy Federation from scratch. You can check it out here.
Tip #1: Understand the difference between Federation and Multisite
Some people get confused with terminology here, so I want to clarify it a bit. Multisite is a design that NSX-T has supported for a while. In this design, we have NSX-T Management components at one site, but it is managing NSX logical routers/segments/security policy for two sites.
With Federation, it's very similar, but in this case, we have NSX-T Management components at both sites (see diagram in the next section), although we still manage everything from one interface. With this design, we can truly lose either site, and keep our network functional.
Tip #2: Learn "Global Manager" vs. "Local Manager"
With a standard NSX-T implementation, you have an NSX-T Manager cluster. In a federation design, there's some new terminology used here, and a new component. With federation, you deploy a cluster of 3 NSX-T Managers at every site. These are just standard NSX-T managers, but we now call them Local Managers.
Federation also introduces the concept of a Global Manager (GM), which is also just an NSX-T Manager cluster, but it has a special role - which is to manage Local Managers, or LM's. From a deployment standpoint, whether you're deploying LMs or GMs, they're all deployed from the same OVA - you just specify the role in the OVA settings.
The first thing you may be wondering is "geez, that's a lot of appliances!" Yes, it is! At a minimum, you need 3x LMs per site, and 3-6x GM's (depending on latency between sites).
In normal operation, the LM's can function independently, and you can configure site-specific config on the LM clusters within each site. If you want to make global configuration changes (such as stretching networks), you do that via the GM.
Tip #3: Don't forget latency (and MTU)
A common question I get is "what is the latency restriction for Federation?" Technically, the only latency requirement you need to be concerned with is the distance between your Global Manager clusters (want them Active/Active? You'll need to be within a few milliseconds of each other). If your Global Manager clusters are going to be Active/Standby (one GM cluster at one site, one GM cluster at the other) - the limit is 150ms between sites. All of this said, the bigger question is: what can your applications tolerate? If they can handle 100ms, then NSX will be fine with that as well!
Now, as far as MTU, this is an interesting one. With NSX-T, you're required to run jumbo MTU (1600+) inside of a single site. With federation, you're not required to run jumbo MTU between sites. After all, many times the inter-site connectivity is via the internet, where we can't enable jumbo MTU. That said, it is strongly recommend that you have a link that you can enable jumbo MTU on with federation to avoid performance issues due to excessive packet fragmentation. So what's the takeaway? If you're doing stretched networks, aim for an MTU of 1700+ between sites.
Tip #4: Spend some time learning Federation specifics
It might seem like a no-brainer, but a lot of people walk into federation assuming it's all the same. While a lot of configuration is very similar to regular NSX-T, there are new concepts that you should understand, especially as it relates to routing. My NSX Federation course covers most of this - which you can get here.
Another concept you'll want to become familiar with is RTEP's, or Remote Tunnel End Points. These are just special IP's assigned to NSX-T Edge nodes that are used exclusively for Federation inter-site communication - but, you should understand their role well!
If you also want an excellent free resource, check out the NSX-T Multisite/Federation design guide here. Just be careful - the guide has both multisite and federation design recommendations in there, you want to ignore the multisite sections.
Tip #5: Understand the license requirements for NSX Federation
This one is pretty basic. If you want to deploy Federation, you'll need the NSX-T Enterprise Plus license. While you absolutely can deploy it solo, I do recommend working with VMware directly or a trusted partner - these deployments do carry lots of caveats, and having someone who has done it before is a wise choice.