]

Solution

  • Security

Type

  • Document

Level

  • Intermediate

Category

  • Design Guide

Product

  • NSX Data Center

Phase

  • Design

Use-Case

  • Container Networking
  • Multi-Cloud Networking
  • Network Operations
  • Security

VMware NSX-T Design Guide: Designing Environments with NSX-T

1 Introduction

This design guide provides guidance and best practices for designing environments that leverage the capabilities of VMware NSX-T®. It is targeted at virtualization and network architects interested in deploying NSX solutions.

1.1 How to Use This Document

This document is broken into several sections. The first section explains the architectural building blocks of NSX-T, providing a base understanding for installing and operating an NSX-T environment. For further details, review the complete NSX-T installation and administration guides.

The middle sections examine detailed use cases of network virtualization – sharing details of logical switching, logical routing, and traffic flows. They describe components and functionality utilized in the security use cases. These sections lay the groundwork to help understand and implement the design guidance described in the final section.

The final section details best practices for deploying an NSX-T environment. It offers guidance for a variety of factors including physical infrastructure considerations, compute node requirements, and variably-sized environments.

A list of additional resources is included in Appendix 2 at the end of this document. The glossary of terms is described in Appendix 3.

1.2 Networking and Security Today

In the digital business era, organizations are increasingly building custom applications to drive core business and gain competitive advantages. The speed with which development teams deliver new applications and capabilities directly impacts organizational success and bottom line. This has placed increasing pressure on organizations to innovate quickly and has made developers central to this critical mission. As a result, the way developers create apps, and the way IT provides services for those apps, has been evolving.

Application Proliferation

With applications quickly emerging as the new business model, developers are under immense pressure to deliver apps in record time. This increasing need to deliver more apps in less time can drive developers to use public clouds or open source technologies. These solutions allow them to write and provision apps in a fraction of the time required with traditional methods.

Heterogeneity

This application proliferation has given rise to heterogeneous environments, with application workloads being run inside VMs, containers, clouds, and bare metal. IT departments must maintain governance, security, and visibility for application workloads regardless of whether they reside on premises, in public clouds, or with clouds managed by third-parties.

Cloud-centric Architectures

Cloud-centric architectures and approaches to building and managing applications are increasingly common because of their efficient development environments and fast delivery of applications. These cloud architectures can put pressure on networking and security infrastructure to integrate with private and public clouds. Logical networking and security must be highly extensible to adapt and keep pace with ongoing change.

Against this backdrop of increasing application needs, greater heterogeneity, and complexity of environments, IT must still protect applications and data while addressing the reality of an attack surface that is continuously expanding.

1.3 NSX-T Architecture Value and Scope

VMware NSX-T is designed to address application frameworks and architectures that have heterogeneous endpoints and technology stacks. In addition to vSphere, these environments may include other hypervisors, containers, bare metal, and public clouds. NSX-T allows IT and development teams to choose the technologies best suited for their particular applications. NSX-T is also designed for management, operations, and consumption by development organizations in addition use by IT.

Figure 1‑1: NSX-T Anywhere Architecture

The NSX-T architecture is designed around four fundamental attributes. Figure 1‑1 depicts the universality of those attributes that spans from any site, any cloud, and any endpoint device. This enables greater decoupling, not just at the infrastructure level (e.g., hardware, hypervisor), but also at the public cloud (e.g., AWS, Azure) and container level (e.g., K8, Pivotal), all while maintaining the four key attributes of platform implemented across the domains. Key value concepts and characteristics of NSX-T architecture include:

  • Policy and Consistency: Allows policy definition once and realizable end state via RESTful API, addressing requirements of today’s automated environments. NSX-T maintains unique and multiple inventory and controls to enumerate desired outcomes across diverse domains.
  • Networking and Connectivity: Allows consistent logical switching and distributed routing with multiple vSphere and KVM nodes, without being tied to compute manager/domain. The connectivity is further extended across containers and clouds via domain specific implementation while still providing connectivity across heterogeneous endpoints.
  • Security and Services: Allows a unified security policy model as with networking connectivity. This enables implementation of services such as load balancer, NAT, Edge FW, and DFW across multiple compute domains. Providing consistent security between VMs and container workloads is essential to assuring the integrity of the overall framework set forth by security operations.
  • Visibility: Allows consistent monitoring, metric collection, and flow tracing via a common toolset across compute domains. This is essential for operationalizing mixed workloads – VM and container-centric –typically both have drastically different tools for completing similar tasks.

These attributes enable the heterogeneity, app-alignment, and extensibility required to support diverse requirements. Additionally, NSX-T supports DPDK libraries that offer line-rate stateful services (e.g., load balancers, NAT).

Heterogeneity

In order to meet the needs of heterogeneous environments, a fundamental requirement of NSX-T is to be compute-manager agnostic. As this approach mandates support for multi-hypervisor and/or multi-workloads, there is no 1:1 mapping of NSX to vCenter. When designing the management plane, control plane, and data plane components of NSX-T, special considerations were taken to enable flexibility, scalability, and performance.

The management plane was designed independent of any compute manager, including vSphere. The VMware NSX-T® Manager™ is fully independent; management of the NSX functions happen directly – either programmatically or through the GUI.

The control plane architecture has been separated into two components – a centralized cluster and an endpoint-specific local component. This allows the control plane to scale as the localized implementation – both data plane implementation and security enforcement – is more efficient and allows for heterogeneous environments.

The data plane was designed to be normalized across various environments. NSX-T introduces a host switch that normalizes connectivity among various compute domains, including multiple VMware vCenter® instances, KVM, containers, and other off premises implementations. This switch is referred as N-VDS.

App-aligned

NSX-T was built with the application as the key construct. Whether the app was built in a traditional waterfall model or developed in a newer microservices application framework, NSX-T treats networking and security consistently. This consistency extends across containers and multi-hypervisors on premises, then further into the public cloud. This functionality is first available for Amazon Web Services (AWS) and will extend to other clouds as well on premises connectivity solutions. This enables developers to focus on the platform that provides the most benefit while providing IT operational consistency across networking and security platforms.

Containers and Cloud Native Application Integration with NSX-T

The current era of digital transformation challenges IT in addressing directives to normalize security of applications and data, increase speed of delivery, and improve application availability. IT administrators realize that a new approach must be taken to maintain relevancy. Architecturally solving the problem by specifically defining connectivity, security, and policy as a part of application lifecycle is essential. Programmatic and automatic creation of network and switching segments based on application driven infrastructure is the only way to meet the requirements of these newer architectures.

NSX-T is designed to address the needs of these emerging application frameworks and architectures with heterogeneous endpoints and technology stacks. NSX allows IT and development teams to choose the technologies best suited for their particular applications. It provides a common framework to manage and increase visibility of environments that contain both VMs and containers. As developers embrace newer technologies like containers and the percentage of workloads running in public clouds increases, network virtualization must expand to offer a full range of networking and security services (e.g., LB, NAT, DFW, etc.) native in these environments. By providing seamless network virtualization for workloads running on either VMs or containers, NSX is now supporting multiple CaaS and PaaS solutions where container based applications exist.

Figure 1‑2: Programmatic Integration with Various PaaS and CaaS

The NSX-T Container Plug-in (NCP) is built to provide direct integration with a number of environments where container-based applications could reside. Container orchestrators (sometimes referred to as CaaS), such as Kubernetes (i.e., k8s) are ideal for NSX-T integration. Solutions that contain enterprise distributions of k8s (e.g., RedHat Open Shift, Pivotal Container Service) support solutions with NSX-T. Additionally, NSX-T supports integration with PaaS solutions like Pivotal Cloud Foundry.

The primary component of NCP runs in a container and communicates with both NSX Manager and the Kubernetes API server (in the case of k8s/OpenShift). NCP monitors changes to containers and other resources. It also manages networking resources such as logical ports, switches, routers, and security groups for the containers through the NSX API.

NSX Container Plugin

K8s / OS
Adapter

NCM Infra

NSX Manager API Client

CloudFoundry Adapter

More...

Figure 1‑3: NCP Architecture

NSX Container Plugin: NCP is a software component in the form of a container image, typically run as a Kubernetes pod.

Adapter layer: NCP is built in a modular manner so that individual adapters can be added for a variety of CaaS and PaaS platforms.

NSX Infrastructure layer: Implements the logic that creates topologies, attaches logical ports, etc.

NSX API Client: Implements a standardized interface to the NSX API.

Multi-Cloud Architecture and NSX-T

When extended to workloads in public cloud, NSX-T provides a single pane of glass for networking and security policy management across private and multiple public clouds. NSX-T also provides full topology control over switching and routing in overlay mode and abstracts the limitations of underlying cloud provider networks.

From visibility perspective, it provides view into public cloud inventory such as VMs (e.g., instances) and networks (e.g., VPCs, VNETs). Since the same NSX-T deployment is managing workloads in public cloud, the entire infrastructure can be consistently operated on day two.

Figure 1‑4: Multi-cloud Architecture with NSX-T

The Cloud Service Manager provides inventory view across multiple clouds, multiple use workload accounts, and multiple VPCs/VNETs. The NSX-T Manager and Controllers manage policy consistency across multiple cloud deployments, including private cloud. The Public Cloud Gateway provides a localized NSX control plane in each VPC/VNET and is responsible for pushing policies down to each public cloud instance. In addition, the Public Cloud Gateway provides NSX-T services such as DHCP and NAT, similar to an on-premises NSX-T Edge. The NSX-T agent inside each instance provides a distributed data plane which includes distributed firewall enforcement, logical switching, and logical routing.

For further information on how NSX-T is benefiting in cloud work load please visit NSXCLOUD and explore.

Extensible

The key architectural tenets of heterogeneity and app-alignment are inherently properties of extensibility, but full extensibility requires more. Extensibility also means the ability to support multi-tenant and domain environments along with integration into the DevOps workflow.

2 NSX-T Architecture Components

NSX-T reproduces the complete set of networking services (e.g., switching, routing, firewalling, QoS) in software. These services can be programmatically assembled in arbitrary combinations to produce unique, isolated virtual networks in a matter of seconds. NSX-T works by implementing three separate but integrated planes: management, control, and data. The three planes are implemented as sets of processes, modules, and agents residing on three types of nodes: manager, controller, and transport.

Figure 2‑1: NSX-T Architecture and Components

2.1 Management Plane

The management plane provides a single API entry point to the system. It is responsible for maintaining user configuration, handling user queries, and performing operational tasks on all management, control, and data plane nodes.

NSX-T Manager implements the management plane for the NSX-T ecosystem. It provides an aggregated system view and is the centralized network management component of NSX-T. NSX-T Manager is delivered in a virtual machine form factor and provides the following functionality:

  • Serves as a unique entry point for user configuration via RESTful API or NSX-T user interface.
  • Responsible for storing desired configuration in its database. The NSX-T Manager stores the final configuration request by the user for the system. This configuration will be pushed by the NSX-T Manager to the control plane to become a realized configuration (i.e., a configuration effective in the data plane).
  • Retrieves the desired configuration in addition to system information (e.g., statistics).

All the NSX-T components run a management plane agent (MPA) that connects them to the NSX-T Manager.

2.2 Control Plane

The control plane computes the runtime state of the system based on configuration from the management plane. It is also responsible for disseminating topology information reported by the data plane elements and pushing stateless configuration to forwarding engines.

NSX-T splits the control plane into two parts:

  • Central Control Plane (CCP) The CCP is implemented as a cluster of virtual machines called CCP nodes. The cluster form factor provides both redundancy and scalability of resources. The CCP is logically separated from all data plane traffic, meaning any failure in the control plane does not affect existing data plane operations. User traffic does not pass through the CCP Cluster.
  • Local Control Plane (LCP) – The LCP runs on transport nodes. It is adjacent to the data plane it controls and is connected to the CCP. The LCP is responsible for programing the forwarding entries of the data plane.

2.3 Data Plane

The data plane performs stateless forwarding or transformation of packets based on tables populated by the control plane. It reports topology information to the control plane and maintains packet level statistics.

The transport nodes are the hosts running the local control plane daemons and forwarding engines implementing the NSX-T data plane. These are represented in Figure 2‑1 as an N-VDS. The N-VDS is responsible for switching packets according to the configuration of available network services. There are several types of transport nodes available in NSX-T:

  • N-VDS: NSX Virtual Distributed Switch is a generic software defined switch platform that is hypervisor independent. It is the primary component involved in the data plane of the transport nodes. The N-VDS forwards traffic between components running on the transport node (e.g., between virtual machines) or between internal components and the physical network.
  • Hypervisor Transport Nodes: Hypervisor transport nodes are hypervisors prepared and configured for NSX-T. The N-VDS provides network services to the virtual machines running on those hypervisors. NSX-T currently supports VMware ESXi™ and KVM hypervisors. The N-VDS implemented for KVM is based on the Open vSwitch (OVS) and is platform independent. It could be ported to other hypervisors and serves as the foundation for the implementation of NSX-T in other environments (e.g., cloud, containers, etc.).
  • Edge Nodes: VMware NSX® Edge™ nodes are service appliances dedicated to running network services that cannot be distributed to the hypervisors. They are grouped in one or several clusters, representing a pool of capacity.
  • L2 Bridge: The L2 bridge is an appliance responsible for bridging traffic between the NSX-T overlay and VLAN backed physical networks. An L2 bridge is implemented through a cluster of two ESXi hypervisors (active/standby on a per-VLAN basis) dedicated to bridging.

3 NSX-T Logical Switching

This section details how NSX-T creates virtual L2 networks, called logical switches (LS), to provide connectivity between its services and the different virtual machines in the environment.

3.1 The N-VDS

The primary component involved in the data plane of the transport nodes is the N-VDS. The N-VDS forwards traffic between components running on the transport node (e.g., between virtual machines) or between internal components and the physical network. In the latter case, the N-VDS must own one or more physical interfaces (pNICs) on the transport node. As with other virtual switches, an N-VDS cannot share a physical interface with another N-VDS. It may coexist with another N-VDS when using a separate set of pNICs.

On ESXi hypervisors, the N-VDS implementation is derived from VMware vSphere® Distributed Switch™ (VDS). The N-VDS is mandatory for running an overlay traffic; however, it can co-exist with a VDS in certain configuration. With KVM hypervisors, the N-VDS implementation is derived from the Open vSwitch (OVS). While N-VDS behavior in realizing connectivity is identical regardless of the specific implementation, data plane realization and enforcement capabilities differ based on compute manager and associated hypervisor capability.

 3.1.1 Uplink vs. pNIC

The N-VDS introduces a clean differentiation between the pNICs of the transport node and the uplinks of the N-VDS. The uplinks of the N-VDS are logical constructs that can be mapped to one of multiple pNIC bundled into a link aggregation group (LAG). Figure 3‑1 illustrates the difference between an uplink and a pNIC:

 

Figure 3‑1: N-VDS Uplinks vs. Hypervisor pNICs

In this example, an N-VDS with two uplinks is defined on the hypervisor transport node. One of the uplinks is a LAG, bundling physical port p1 and p2, while the other uplink is only backed by a single physical port p3. Both uplinks look the same from the perspective of the N-VDS; there is no functional difference between the two.

 3.1.2 Teaming Policy

The teaming policy defines how the N-VDS uses its uplink for redundancy and traffic load balancing. There are two options for teaming policy configuration:

  • Failover Order – An active uplink is specified along with an optional list of standby uplinks. Should the active uplink fail, the next uplink in the standby list takes its place immediately.
  • Load Balanced Source – A list of active uplinks is specified, and each interface on the transport node is pinned to one active uplinks. This allows use of several active uplinks at the same time.

The teaming policy only defines how the N-VDS balances traffic across its uplinks. The uplinks can be individual pNICs or LAGs. LAGs have their own load-balancing options, but they are configured part of the uplink definition, thus do not show up in the N-VDS teaming policy.

Figure 3‑2: N-VDS Teaming Policies

Figure 3‑2 presents both teaming policy options, detailing how the traffic from two different VMs is distributed across uplinks. The uplinks of the N-VDS could be any combination of single pNICs or LAGs; this would not impact the way traffic is balanced. When an uplink is a LAG, it is only considered down when all the physical members of the LAG are down.

KVM hypervisor transport nodes can only have a single LAG and only support the failover order teaming policy; the load balance source teaming policy is not available for KVM. In order for more than one physical uplink to be active on an N-VDS on a KVM hypervisor, a LAG must be configured. Using the load balance source teaming policy, ESXi hypervisors can load balance traffic across several active uplinks simultaneously.

 3.1.3 Uplink Profile

The uplink profile is a template that defines how an N-VDS connects to the physical network. It specifies:

  • format of the uplinks of an N-VDS
  • teaming policy applied to those uplinks
  • transport VLAN used for overlay traffic
  • MTU of the uplinks

When an N-VDS is attached to the network, an uplink profile as well as the list of local interfaces corresponding to the uplink profile must be specified. Figure 3‑3 shows how a transport node “TN1” is attached to the network using the uplink profile “UPP1”.

Figure 3‑3: Uplink Profile

The uplink profile specifies a failover order teaming policy applied to the two uplinks “U1” and “U2”. Uplinks “U1” and “U2” are detailed as being a LAGs consisting of two ports each. The profile also defines the transport VLAN for overlay traffic as “VLAN 100” as well as an MTU of 1700.

The designations “U1”, “U2”, “port1”, “port2”, etc. are simply variables in the template for uplink profile “UPP1”. When applying “UPP1” on the N-VDS of “TN1”, a value must be specified for those variables. In this example, the physical port “p1” on “TN1” is assigned to the variable “port1” on the uplink profile, pNIC “p2” is assigned to variable “port2”, and so on. By leveraging an uplink profile for defining the uplink configuration, it can be applied to several N-VDS spread across different transport nodes, giving them a common configuration. Applying the same “UPP1” to all the transport nodes in a rack enables a user to change the transport VLAN or the MTU for all those transport nodes by simply editing the uplink profile “UPP1”. This flexibility is critical for brownfield or migration cases.

The uplink profile model also offers flexibility in N-VDS uplink configuration. Several profiles can be applied to different groups of switches. In the example in Figure 3‑4, uplink profiles “ESXi” and “KVM” have been created with different teaming policies. With these two separate uplink profiles, two different hypervisor transport nodes can be attached to the same network leveraging different teaming policies.

Figure 3‑4: Leveraging Different Uplink Profiles

In this example, “TN1” can leverage the load balance source teaming policy through its dedicated uplink profile. If it had to share its uplink configuration with “TN2” it would have to use the only teaming policy common to KVM and ESXi - in this instance “Failover Order”. The uplink profile model also allows for different transport VLANs on different hosts. This can be useful when the same VLAN ID is not available everywhere in the network.

 3.1.4 Transport Zones, Host Switch Name

In NSX-T, virtual layer 2 domains are called logical switches. Logical switches are defined as part of a transport zone. There are two types of transport zones:

  • Overlay transport zones
  • VLAN transport zones

The type of the logical switch is derived from the transport zone it belongs to. Connecting an N-VDS to the network is achieved by attaching it to one or more transport zones. Based on this attachment, the N-VDS has access to the logical switches defined within the transport zone(s). A transport zone defines the span of the virtual network, as logical switches only extend to N-VDS on the transport nodes attached to the transport zone.

Figure 3‑5: NSX-T Transport Zone

Figure 3‑5 represents a group of N-VDS - numbered 2 through N - attached to an overlay transport zone. “N-VDS1” is not attached to the same transport zone and thus has no access to the logical switches created within it.

Both the transport zones and N-VDs also have a “Host Switch Name” or “Edge Switch Name” field. This is dependent on the kind of transport node on which the N-VDS is located and represented in the example as the “Name” field. This name allows grouping N-VDS. A N-VDS can only attach to a transport zone with a matching name. In the diagram, “N-VDS2” thru “N-VDSN”, as well as the overlay transport zone, are configured with the same name “Prod”. “N-VDS1”, with a name of “LAB”, cannot be attached to the transport zone with a name of “Prod”.

Other restrictions related to transport zones:

  • A N-VDS can only attach to a single overlay transport zone
  • A N-VDS can only attach to a single VLAN transport zone; however, a transport node can attach multiple VLAN transport zones with multiple N-VDS
  • A N-VDS can attach to both an overlay transport zone and a VLAN transport zone at the same time. In that case, those transport zones have the same N-VDS name.
  • A transport node can only attach to a single overlay transport zone. Therefore, on a transport node, only a single N-VDS can attach to an overlay transport zone.
  • Multiple N-VDS and VDS can coexist on a transport node; however, a pNIC can only be associated with a single N-VDS or VDS.
  • Only one teaming policy can be applied to an entire N-VDS, thus it is necessary to deploy multiple N-VDS to achieve traffic segregation if desired.

3.2 Logical Switching

This section on logical switching focuses on overlay backed logical switches due to their ability to create isolated logical L2 networks with the same flexibility and agility that exists with virtual machines. This decoupling of logical switching from the physical network infrastructure is one of the main benefits of adopting NSX-T.

 3.2.1 Overlay Backed Logical Switches

Figure 3‑6 presents logical and physical network views of a logical switching deployment.

 

Figure 3‑6: Overlay Networking – Logical and Physical View

In the upper part of the diagram, the logical view consists of five virtual machines that are attached to the same logical switch, forming a virtual broadcast domain. The physical representation, at the bottom, shows that the five virtual machines are running on hypervisors spread across three racks in a data center. Each hypervisor is an NSX-T transport node equipped with a tunnel endpoint (TEP). The TEPs are configured with IP addresses, and the physical network infrastructure provides IP connectivity - leveraging layer 2 or layer 3 - between them. The VMware® NSX-T Controller™ (not pictured) distributes the IP addresses of the TEPs so they can set up tunnels with their peers. The example shows “VM1” sending a frame to “VM5”. In the physical representation, this frame is transported via an IP point-to-point tunnel between transport node “HV1” to transport node “HV5”.

The benefit of this NSX-T overlay model is that it allows direct connectivity between transport nodes irrespective of the specific underlay inter-rack connectivity (i.e., L2 or L3). Virtual logical switches can also be created dynamically without any configuration of the physical network infrastructure.

 3.2.2 Flooded Traffic

The NSX-T logical switch behaves like a virtual Ethernet segment, providing the capability of flooding traffic to all the devices attached to this segment; this is a cornerstone capability of layer 2. NSX-T does not differentiate between the different kinds of frames replicated to multiple destinations. Broadcast, unknown unicast, or multicast traffic will be flooded in a similar fashion across a logical switch. In the overlay model, the replication of a frame to be flooded on a logical switch is orchestrated by the different NSX-T components. NSX-T provides two different methods for flooding traffic described in the following sections. They can be selected on a per logical switch basis.

 3.2.2.1 Head-End Replication Mode

In the head end replication mode, the transport node at the origin of the frame to be flooded on a logical switch sends a copy to each and every other transport node that is connected to this logical switch.

Figure 3‑7 offers an example of virtual machine “VM1” on hypervisor “HV1” attached to logical switch “LS1”. “VM1” sends a broadcast frame on “LS1”. The N-VDS on “HV1” floods the frame to the logical ports local to “HV1”, then determines that there are remote transport nodes part of “LS1”. The NSX-T Controller advertised the TEPs of those remote interested transport nodes, so “HV1” will send a tunneled copy of the frame to each of them.

Figure 3‑7: Head-end Replication Mode

The diagram illustrates the flooding process from the hypervisor transport node where “VM1” is located. “HV1” sends a copy of the frame that needs to be flooded to every peer that is interested in receiving this traffic. Each green arrow represents the path of a point-to-point tunnel through which the frame is forwarded. In this example, hypervisor “HV6” does not receive a copy of the frame. This is because the NSX-T Controller has determined that there is no recipient for this frame on that particular hypervisor.

In this mode, the burden of the replication rests entirely on source hypervisor. Seven copies of the tunnel packet carrying the frame are sent over the uplink of “HV1”. This should be taken into account when provisioning the bandwidth on this uplink. Typical use case of this type of replication is necessary when underlay is an L2 fabric.

 3.2.2.2 Two-tier Hierarchical Mode

The two-tier hierarchical mode achieves scalability through a divide and conquest method. In this mode, hypervisor transport nodes are grouped according to the subnet of the IP address of their TEP. Hypervisor transport nodes in the same rack typically share the same subnet for their TEP IPs, though this is not mandatory. Based on this assumption, Figure 3‑8 shows hypervisor transport nodes classified in three groups: subnet 10.0.0.0, subnet 20.0.0.0 and subnet 30.0.0.0. In this example, the IP subnet have been chosen to be easily readable; they are not public IPs.

Figure 3‑8: Two-tier Hierarchical Mode

Assume that “VM1” on “HV1” needs to send the same broadcast on “LS1” as in the previous section on head-end replication. Instead of sending an encapsulated copy of the frame to each remote transport node attached to “LS1”, the following process occurs:

  1. “HV1” sends a copy of the frame to all the transport nodes within its group (i.e., with a TEP in the same subnet as its TEP). In this case, “HV1” sends a copy of the frame to “HV2” and “HV3”.
  2. “HV1” sends a copy to a single transport node on each of the remote groups. For the two remote groups - subnet 20.0.0.0 and subnet 30.0.0.0 – “HV1” selects an arbitrary member of those groups and sends a copy there. In this example, “HV1” selected “HV5” and “HV7”.
  3. Transport nodes in the remote groups perform local replication within their respective groups. “HV5” relays a copy of the frame to “HV4” while “HV7” sends the frame to “HV8” and “HV9”. Note that “HV5” does not relay to “HV6” as it is not interested in traffic from “LS1”.

The source hypervisor transport node knows about the groups based on the information it has received from the NSX-T Controller. It does not matter which transport node is selected to perform replication in the remote groups so long as the remote transport node is up and available. If this were not the case (e.g., “HV7” was down), the NSX-T Controller would update all transport nodes attached to “LS1”. “HV1” would then choose “HV8” or “HV9” to perform the replication local to group 30.0.0.0.

In this mode, as with head end replication example, seven copies of the flooded frame have been made in software, though the cost of the replication has been spread across several transport nodes. It is also interesting to understand the traffic pattern on the physical infrastructure. The benefit of the two-tier hierarchical mode is that only two tunnel packets were sent between racks, one for each remote group. This is a significant improvement in the network inter-rack fabric utilization - where available bandwidth is typically less available than within a rack - compared to the head end mode that sent five packets. That number that could be higher still if there were more transport nodes interested in flooded traffic for “LS1” on the remote racks. Note also that this benefit in term of traffic optimization provided by the two-tier hierarchical mode only applies to environments where TEPs have their IP addresses in different subnets. In a flat Layer 2 network, where all the TEPs have their IP addresses in the same subnet, the two-tier hierarchical replication mode would lead to the same traffic pattern as the source replication mode.

The default two-tier hierarchical flooding mode is recommended as a best practice as it typically performs better in terms of physical uplink bandwidth utilization and reduces CPU utilization.

3.2.3 Unicast Traffic

At layer 2 when a frame is destined to an unknown MAC address, it is flooded in the network. Switches typically implement a MAC address table, or filtering database (FDB), that associates MAC addresses to ports in order to prevent flooding. When a frame is destined to a unicast MAC address known in the MAC address table, it is only forwarded by the switch to the corresponding port.

The N-VDS maintains such a table for each logical switch it is attached to. Either MAC address can be associated with a virtual NIC (vNIC) of a locally attached VM or remote TEP when the MAC address is located on a remote transport node reached via the tunnel identified by a TEP.

Figure 3‑9 illustrates virtual machine “Web3” sending a unicast frame to another virtual machine “Web1” on a remote hypervisor transport node. In this example, the N-VDS on both the source and destination hypervisor transport nodes are fully populated.

Figure 3‑9: Unicast Traffic between VMs

  1. “Web3” sends a frame to “Mac1”, the MAC address of the vNIC of “Web1”.
  2. The N-VDS on “HV3” receives the frame and performs a lookup for the destination MAC address in its MAC address table. There is a hit. “Mac1” is associated to the “TEP1” on “HV1”.
  3. The N-VDS on “HV3” encapsulates the frame and sends it on the overlay to “TEP1”.
  4. The N-VDS on “HV1” receives the tunnel packet, decapsulates the frame, and performs a lookup for the destination MAC. “Mac1” is also a hit there, pointing to the vNIC of “VM1”. The frame is then delivered to its final destination.

This m

4 NSX-T Logical Routing

The logical routing capability in the NSX-T platform provides the ability to interconnect both virtual and physical workloads deployed in different logical L2 networks. NSX-T enables the creation of network elements like switches and routers in software as logical constructs and embeds them in the hypervisor layer, abstracted from the underlying physical hardware.

Since these network elements are logical entities, multiple logical routers can be created in an automated and agile fashion.

The previous chapter showed how to create logical switches; this chapter focuses on how logical routers provide connectivity between different logical L2 networks.

Figure 4‑1 shows both logical and physical view of a routed topology connecting logical switches on multiple hypervisors. Virtual machines “Web1” and “Web2” are connected to “Logical Switch 1” while “App1” and “App2” are connected to “Logical Switch 2”.

Figure 4‑1: Logical and Physical View of Routing Services

In a data center, traffic is categorized as east-west (E-W) or north-south (N-S) based on the origin and destination of the flow. When virtual or physical workloads in a data center communicate with the devices external to the datacenter (e.g., WAN, Internet), the traffic is referred to as north-south traffic. The traffic between workloads confined within the data center is referred to as east-west traffic. In modern data centers, more than 70% of the traffic is east-west.

For a multi-tiered application where the web tier needs to talk to the app tier and the app tier needs to talk to the database tier, these different tiers sit in different subnets. Every time a routing decision is made, the packet is sent to the router. Traditionally, a centralized router would provide routing for these different tiers. With VMs that are hosted on same the ESXi or KVM hypervisor, traffic will leave the hypervisor multiple times to go to the centralized router for a routing decision, then return to the same hypervisor; this is not optimal.

NSX-T is uniquely positioned to solve these challenges as it can bring networking closest to the workload. Configuring a logical router via NSX-T Manager instantiates a logical router on each hypervisor. For the VMs hosted (e.g., “Web 1”, “App 1”) on the same hypervisor, the E-W traffic does not need to leave the hypervisor for routing.

4.1 Logical Router Components

A logical router consists of two components: distributed router (DR) and services router (SR).

4.1.1 Distributed Router (DR)

A DR is essentially a router with logical interfaces (LIFs) connected to multiple subnets. It runs as a kernel module and is distributed in hypervisors across all transport nodes, including Edge nodes.

When a logical router is created through NSX-T Manager or the API, the management plane validates and stores configuration. It then pushes this configuration via the Rabbit MQ message bus to the CCP nodes, which in turn push this configuration to the LCPs on all the transport nodes. A DR instance is instantiated as a kernel module. Figure 4‑2 diagrams the workflow of DR creation.

Figure 4‑2: End-to-end Communication Flow for Management, Control, and Data Planes

The traditional data plane functionality of routing and ARP lookups is performed by the logical interfaces connecting to the different logical switches. Each LIF has a vMAC address and an IP address representing the default IP gateway for its logical L2 segment. The IP address is unique per LIF and remains the same anywhere the logical switch exists. The vMAC associated with each LIF remains constant in each hypervisor, allowing the default gateway and MAC to remain the same during vMotion. Workflow for creating a logical interface (LIF) remains the same (i.e., MP->CCP->LCP) on every transport node and an interface is created on DR on every transport node.

The left side of Figure 4‑3 shows that the DR has been configured with two LIFs - 172.16.10.1/24 and 172.16.20.1/24. The right side shows that the DR is instantiated as a kernel module – rather than a VM – on two hypervisors, each with the same IP addresses.

Figure 4‑3: E-W Routing with Workload on the same Hypervisor

East-West Routing - Distributed Routing with Workloads on the Same Hypervisor

In this example, VMs “Web1” and “App1” are hosted on the same hypervisor. The logical router and two LIFs, connected to “Web-LS” and “App-LS” with IP addresses of 172.16.10.1/24 and 172.16.20.1/24 respectively, have been created via NSX-T Manager. Since “Web1” and “App1” are both hosted on hypervisor “HV1”, routing between them happens on the DR located on that same hypervisor.

Figure 4‑4 presents the logical packet flow between two VMs on the same hypervisor.

Figure 4‑4: Packet Flow between two VMs on same Hypervisor

  1. “Web1” (172.16.10.11) sends a packet to “App1” (172.16.20.11). The packet is sent to the default gateway interface (172.16.10.1) for “Web1” located on the local DR.
  2. The DR on “HV1” performs a routing lookup which determines that the destination subnet 172.16.20.0/24 is a directly connected subnet on “LIF2”. A lookup is performed in the “LIF2” ARP table to determine the MAC address associated with the IP address for “App1”. If the ARP entry does not exist, the controller is queried. If there is no response from controller, the frame is flooded to learn the MAC address of “App1”.
  3. Once the MAC address of “App1” is learned, the L2 lookup is performed in the local MAC table to determine how to reach “App1” and the packet is sent.
  4. The return packet from “App1” follows the same process and routing would happen again on the local DR.

In this example, neither the initial packet from “Web1” to “App1” nor the return packet from “App1” to “Web1” ever left the hypervisor as part of routing.

East-West Routing - Distributed Routing with Workloads on Different Hypervisor

In this example, the target workload “App2” differs as it rests on a hypervisor named “HV2”. If “Web1” needs to communicate with “App2”, the traffic would have to leave the hypervisor “HV1” as these VMs are hosted on two different hypervisors. Figure 4‑5 shows a logical view of topology, highlighting the routing decisions taken by the DR on “HV1” and the DR on “HV2”.

When “Web1” sends traffic to “App2”, routing is done by the DR on “HV1”. The reverse traffic from “App2” to “Web1” is routed by “HV2”.

Figure 4‑5: E-W Packet Flow between two Hypervisors

Figure 4‑6 shows the corresponding physical topology and packet walk from “Web1” to “App2”.

Figure 4‑6: End-to-end E-W Packet Flow

  1. “Web1” (172.16.10.11) sends a packet to “App2” (172.16.20.12). The packet is sent to the default gateway interface (172.16.10.1) for “Web1” located on the local DR. Its L2 header has the source MAC as “MAC1” and destination MAC as the vMAC of the DR. This vMAC will be the same for all LIFs.
  2. The routing lookup happens on the “ESXi” DR, which determines that the destination subnet 172.16.20.0/24 is a directly connected subnet on “LIF2”. A lookup is performed in “LIF2” ARP table to determine the MAC address associated with the IP address for “App2”. This destination MAC, “MAC2”, is learned via the remote “KVM” TEP 20.20.20.20.
  3. “ESXi” TEP encapsulates the original packet and sends it to the “KVM” TEP with a source IP address of 10.10.10.10 and destinations IP address of 20.20.20.20 for the encapsulating packet. The destination virtual network identifier (VNI) in the GENEVE encapsulated packet belongs to “App LS”.
  4. “KVM” TEP 20.20.20.20 decapsulates the packet, removing the outer header upon reception. It performs an L2 lookup in the local MAC table associated with “LIF2”.
  5. Packet is delivered to “App2” VM.

The return packet from “App2” destined for “Web1” goes through the same process. For the return traffic, the routing lookup happens on the “KVM” DR. This represents the normal behavior of the DR, which is to always perform local routing on the DR instance running in the kernel of the hypervisor hosting the workload that initiates the communication. After routing lookup, the packet is encapsulated by the “KVM” TEP and sent to the remote “ESXi” TEP. It decapsulates the transit packet and delivers the original IP packet from “App2” to “Web1”.

4.1.2 Services Router

East-west routing is completely distributed in the hypervisor, with each hypervisor in the transport zone running a DR in its kernel. However, some facets of NSX-T are not distributed, including:

  • Physical infrastructure connectivity
  • NAT
  • DHCP server
  • Metadata Proxy for OpenStack
  • Edge Firewall

A services router (SR) – also referred to as a services component – is instantiated when a service is enabled that cannot be distributed on a logical router.

A centralized pool of capacity is required to run these services in a highly-available and scale-out fashion. The appliances where the centralized services or SR instances are hosted are called Edge nodes. An Edge node is the appliance that provides connectivity to the physical infrastructure.

Figure 4‑7 contains the logical view of a logical router showing both DR and SR components when connected to a physical router.

 

Figure 4‑7: Logical Router Components and Inerconnection

The logical router in the figure contains the following interfaces:

  • Uplink (Ext) – Interface connecting to the physical infrastructure/router. Static routing and eBGP are supported on this interface. “Ext” is used to depict external connectivity to reduce the confusion resulting from meaning of “uplink” which is specific to “uplink” of either N-VDS or VDS.
  • Downlink – Interface connecting to a logical switch.
  • Intra-Tier Transit Link – Internal link between the DR and SR. This link along with logical switch prefixed with “transit-bp” is created automatically and defaults to an IP address in 169.254.0.0/28 subnet. The address range may be changed if it is used somewhere else in the network.

The management plane configures a default route on the DR with the next hop IP address of the SR’s intra-tier transit link IP. This allows the DR to take care of E-W routing while the SR provides N-S connectivity to all the subnets connected to the DR. The management plane also creates routes on the SR for the subnets connected to the DR with a next hop IP of the DR’s intra-tier transit link.

North-South Routing by SR Hosted on Edge Node

From a physical topology perspective, workloads are hosted on hypervisors and N-S connectivity is provided by Edge nodes. If a device external to the data center needs to communicate with a virtual workload hosted on one of the hypervisors, the traffic would have to come to the Edge nodes first. This traffic will then be sent on an overlay network to the hypervisor hosting the workload. Figure 4‑8 diagrams the traffic flow from a VM in the data center to an external physical infrastructure.

Figure 4‑8: N-S Routing Packet Flow

Figure 4‑9 shows a detailed packet walk from data center VM “Web1” to a device on the L3 physical infrastructure. As discussed in the E-W routing section, routing always happens closest to the source. In this example, eBGP peering has been established between the physical router interface with an IP address, 192.168.240.1 and an SR hosted on the Edge node with an IP address on uplink of 192.168.240.3. The physical router knows about the 172.16.10.0/24 prefix internal to the datacenter. The SR hosted on the Edge node knows how to get to the 192.168.100.0/24 prefix.

Figure 4‑9: End-to-end Packet Flow – Application “Web1” to External

  1. “Web1” (172.16.10.11) sends a packet to 192.168.100.10. The packet is sent to the “Web1” default gateway interface (172.16.10.1) located on the local DR.
  2. The packet is received on the local DR. The destination 192.168.100.10 is external to data center, so the packet needs to go to the Edge node that has connectivity to the physical infrastructure. The DR has a default route with the next hop as its corresponding SR, which is hosted on the Edge node. After encapsulation, the DR sends this packet to the SR.
  3. The Edge node is also a transport node. It will encapsulate/decapsulate the traffic sent to or received from compute hypervisors. The “ESXi” TEP encapsulates the original packet and sends it to the Edge node TEP with a source IP address of 10.10.10.10 and destination IP address of 30.30.30.30.
  4. The Edge node TEP decapsulates the packet, removing the outer header prior to sending it to the SR.
  5. The SR performs a routing lookup and determines that the route 192.168.100.0/24 is learned via the uplink port with a next hop IP address 192.168.240.1.
  6. The packet is sent on an external VLAN to the physical router and is finally delivered to 192.168.100.10.

Observe that routing and ARP lookup happened on the DR hosted on the ESXi hypervisor in order to determine that packet must be sent to the SR. No such lookup was required on the DR hosted on the Edge node. After removing the tunnel encapsulation on the Edge node, the packet was sent directly to SR.

Figure 4‑10 follows the packet walk for the reverse traffic from an external device to “Web1”.

Figure 4‑10: End-to-end Packet Flow – External to Application “Web1”

  1. An external device (192.168.100.10) sends a packet to “Web1” (172.16.10.11). The packet is routed by the physical router and sent to the uplink port of the SR.
  2. Routing lookup happens on the SR which determines that 172.16.10.0/24 is reachable via the DR. Traffic is sent to the local DR via the intra-tier transit link between the SR and DR.
  3. The DR performs a routing lookup which determines that the destination subnet 172.16.10.0/24 is a directly connected subnet on “LIF1”. A lookup is performed in the “LIF1” ARP table to determine the MAC address associated with the IP address for “Web1”. This destination MAC “MAC1” is learned via the remote TEP (10.10.10.10), which is the “ESXi” host where “Web1” is located.
  4. The Edge TEP encapsulates the original packet and sends it to the remote TEP with an outer packet source IP address of 30.30.30.30 and destination IP address of 10.10.10.10. The destination VNI in this GENEVE-encapsulated packet is of “Web LS”.
  5. The “ESXi” host decapsulates the packet and removes the outer header upon receiving the packet. An L2 lookup is performed in the local MAC table associated with “LIF1”.
  6. The packet is delivered to Web1.

This time routing and ARP lookup happened on the DR hosted on the Edge node. No such lookup was required on the DR hosted on the “ESXi” hypervisor, and packet was sent directly to the VM after removing the tunnel encapsulation header.

4.2 Two-Tier Routing

In addition to providing optimized distributed and centralized routing functions, NSX-T supports a multi-tiered routing model with logical separation between provider router function and tenant routing function. The concept of multi-tenancy is built into the routing model. The top-tier logical router is referred to as tier-0 while the bottom-tier logical router is tier-1. This structure gives both provider and tenant administrators complete control over their services and policies. The provider administrator controls and configures tier-0 routing and services, while the tenant administrators control and configure tier-1.

Configuring two tier routing is not mandatory. It can be single-tiered with as shown in the previous N-S routing section. There are several advantages of a multi-tiered design which will be discussed in later parts of the design guide. Figure 4‑11 presents an NSX-T two-tier routing architecture.

Figure 4‑11: Two Tier Routing and Scope of Provisioning

Northbound, the tier-0 LR connects to one or more physical routers/L3 switches and serves as an on/off ramp to the physical infrastructure. Southbound, the tier-0 LR router connects to one or more tier-1 LRs or directly to one or more logical switches as shown in north-south touting section. In the north-south routing section, the LR used to connect to the physical router is a tier-0 LR. Northbound, the tier-1 LR connects to a tier-0 LR. Southbound, it connects to one or more logical switches.

This model also eliminates the dependency on a physical infrastructure administrator to configure or change anything on the physical infrastructure when a new tenant is configured in the data center. For a new tenant, the tier-0 LR simply advertises the new tenant routes learned from the tenant tier-1 LR on the established routing adjacency with the physical infrastructure. Concepts of DR/SR discussed in the logical router section remain the same for multi-tiered routing. When a user creates a tier-1 or a tier-0 LR, a DR instance is instantiated on all the hypervisors and Edge nodes.

If a centralized service is configured on either a tier-0 or tier-1 LR, a corresponding SR is instantiated on the Edge node. When a tier-0 LR is connected to physical infrastructure, a tier-0 SR is instantiated on the Edge node. Similarly, when a centralized service like NAT is configured on a tier-1 LR, a tier-1 SR is instantiated on the Edge node.

4.2.1 Interface Types on Tier-1 and Tier-0 Logical Routers

Uplink and downlink interfaces were previously introduced in the services router section. Figure 4‑12 shows these interfaces types along with a new routerlink type in a two-tiered topology.

Figure 4‑12: Anatomy of Components with Logical Routing

  • Uplink (Ext): Interface connecting to the physical infrastructure/router. Static routing and eBGP are supported on this interface. Bidirectional Forwarding Detection (BFD) is supported for both static routing and eBGP. ECMP is also supported on this interface.
  • Downlink: Interface connecting logical switches on tier-0 or tier-1 LRs.
  • Routerlink: Interface connecting tier-0 and tier-1 LRs. Each tier-0-to-tier-1 peer connection is provided a /31 subnet within the 100.64.0.0/10 reserved address space (RFC6598). This link is created automatically when the tier-0 and tier-1 routers are connected.

4.2.2 Route Types on Tier-1 and Tier-0 Logical Routers

There is no dynamic routing between tier-1 and tier-0 LRs. The NSX-T platform takes care of the auto-plumbing between LRs at different tiers. The following list details route types on tier-1 and tier-0 LRs.

  • Tier-1
    • Static – Static routes configured by user via NSX-T Manager.
    • NSX Connected – Directly connected LIF interfaces on the LR. 172.16.10.0/24 and 172.16.20.0/24 are NSX Connected routes for the tier-1 LR in Figure 4-12.
    • Tier-1 NAT – NAT IP addresses owned by the tier-1 LR discovered from NAT rules configured on the tier-1 LR. These are advertised to the tier-0 LR as /32 routes.
  • Tier-0
    • Static – Static routes configured by user via NSX-T Manager.
    • NSX Connected – Directly connected LIF interfaces on the LR. 172.16.30.0/24 is an NSX Connected route for the tier-0 LR in the Figure 4-12.
    • NSX Static – There is no dynamic routing on the routerlink between the tier-1 and tier-0 LR. To provide reachability to subnets connected to the tier-1 LR, the management plane configures static routes for all the LIFs connected to the tier-1 LR with a next hop IP address of the tier-1 LR routerlink IP. 172.16.10.0/24 and 172.16.20.0/24 are seen as NSX Static routes on the tier-0 LR. The next hop for these two routes would be the routerlink IP address of the tier-1 LR.
    • Tier-0 NAT – NAT IP addresses owned by the tier-0 LR discovered from NAT rules configured on tier-0 LR.
    • eBGP –routes learned via an eBGP neighbor.

Route Advertisement on the Tier-1 and Tier-0 Logical Router

As discussed, the tier-0 LR provides N-S connectivity and connects to the physical routers. The tier-0 LR could use static routing or eBGP to connect. The tier-1 LR cannot connect to physical routers directly; it has to connect to a tier-0 LR to provide N-S connectivity to the subnets attached to it. Figure 4‑13 explains the route advertisement on both the tier-1 and tier-0 LR.

Figure 4‑13: Routing Advertisement

“Tier-0 LR” sees 172.16.10.0/24 and 172.16.20.0/24 as NSX Static routes. “Tier-0 LR” sees 172.16.30.0/24 as NSX Connected route. As soon as “Tier-1 LR” is connected to “Tier-0 LR”, the management plane configures a default route on “Tier-1 LR” with next hop IP address as routerlink IP of “Tier-0 LR”.

Northbound, “Tier-0 LR” redistributes the NSX Connected and NSX Static in eBGP. It advertises these to its eBGP neighbor, the physical router.

4.2.3 Fully Distributed Two Tier Routing

NSX-T provides a fully distributed routing architecture. The motivation is to provide routing functionality closest to the source. NSX-T leverages the same distributed routing architecture discussed in distributed router section and extends that to multiple tiers.

Figure 4‑14 shows both logical and per transport node views of two tier-1 LRs serving two different tenants and a tier-0 LR. Per transport node view shows that the tier-1 DRs for both tenants and the tier-0 DR have been instantiated on two hypervisors

Figure 4‑14: Logical Routing Instances

If “VM1” in tenant 1 needs to communicate with “VM3” in tenant 2, routing happens locally on hypervisor “HV1”. This eliminates the need to route of traffic to a centralized location on order to route between different tenants.

Multi-Tier Distributed Routing with Workloads on the same Hypervisor

The following list provides a detailed packet walk between workloads residing in different tenants but hosted on the same hypervisor.

  1. “VM1” (172.16.10.11) in tenant 1 sends a packet to “VM3” (172.16.201.11) in tenant 2. The packet is sent to its default gateway interface located on tenant 1, the local tier-1 DR.
  2. Routing lookup happens on the tenant 1 tier-1 DR and the packet is routed to the tier-0 DR following the default route. This default route has the routerlink interface IP address (100.64.224.0/31) as a next hop.
  3. Routing lookup happens on the tier-0 DR. It determines that the 172.16.201.0/24 subnet is learned the tenant 2 tier-1 DR (100.64.224.3/31) and the packet is routed there.
  4. Routing lookup happens on the tenant 2 tier-1 DR. This determines that the 172.16.201.0/24 subnet is directly connected. L2 lookup is performed in the local MAC table to determine how to reach “VM3” and the packet is sent.

The reverse traffic from “VM3” follows the similar process. A packet from “VM3” to destination 172.16.10.11 is sent to the tenant-2 tier-1 DR, then follows the default route to the tier-0 DR. The tier-0 DR routes this packet to the tenant 1 tier-1 DR and the packet is delivered to “VM1”. During this process, the packet never left the hypervisor to be routed between tenants.

Multi-Tier Distributed Routing with Workloads on different Hypervisors

Figure 4‑15 shows the packet flow between workloads in different tenants which are also located on different hypervisors.

Figure 4‑15: Logical routing end-to-end packte Flow between hypervisor

The following list provides a detailed packet walk between workloads residing in different tenants and hosted on the different hypervisors.

  1. “VM1” (172.16.10.11) in tenant 1 sends a packet to “VM2” (172.16.200.11) in tenant 2. The packet is sent to its default gateway interface located on the local tier-1 DR.
  2. Routing lookup happens on the tenant 1 tier-1 DR and the packet is routed to the tier-0 DR. It follows the default route to the tier-0 DR with a next hop IP of 100.64.224.0/31.
  3. Routing lookup happens on the tier-0 DR which determines that the 172.16.200.0/24 subnet is learned via the tenant 2 tier-1 DR (100.64.224.3/31) and the packet is routed accordingly.
  4. Routing lookup happens on the tenant 2 tier-1 DR which determines that the 172.16.200.0/24 subnet is a directly connected subnet. A lookup is performed in ARP table to determine the MAC address associated with the “VM2” IP address. This destination MAC is learned via the remote TEP on hypervisor “HV2”.
  5. The “HV1” TEP encapsulates the packet and sends it to the “HV2” TEP.
  6. The “HV2” TEP decapsulates the packet and a L2 lookup is performed in the local MAC table associated to the LIF where “VM2” is connected.
  7. The packet is delivered to “VM2”.

The return packet follows the same process. A packet from “VM2” gets routed to the local hypervisor tier-1 DR and is sent to the tier-0 DR. The tier-0 DR routes this packet to tenant 1 tier-1 DR which performs the L2 lookup to find out that the MAC associated with “VM1” is on remote hypervisor “HV1”. The packet is encapsulated by “HV2” and sent to “HV1”, where this packet is decapsulated and delivered to “VM1".

4.3 Edge Node

Edge nodes are service appliances with pools of capacity, dedicated to running network services that cannot be distributed to the hypervisors. Edge nodes can be viewed as empty containers when they are first deployed.

An Edge node is the appliance that provides physical NICs to connect to the physical infrastructure. Previous sections mentioned that centralized services will run on the SR component of logical routers. These features include:

  • Connectivity to physical infrastructure
  •  NAT
  •  DHCP server
  •  Metadata proxy
  •  Edge firewall

As soon as one of these services is configured or an uplink is defined on the logical router to connect to the physical infrastructure, a SR is instantiated on the Edge node. The Edge node is also a transport node just like compute nodes in NSX-T, and similar to compute node it can connect to more than one transport zone – one for overlay and other for N-S peering with external devices.

There are two transport zones on the Edge:

  • Overlay Transport Zone: Any traffic that originates from a VM participating in NSX-T domain may require reachability to external devices or networks. This is typically described as external north-south traffic. The Edge node is responsible for decapsulating the overlay traffic received from compute nodes as well as encapsulating the traffic sent to compute nodes.
  • VLAN Transport Zone: In addition to the encapsulate/decapsulate traffic function, Edge nodes also need a VLAN transport zone to provide uplink connectivity to the physical infrastructure.

Redundancy is recommended for the uplinks. There is complete flexibility placing both uplinks in the same VLAN sharing the same subnet or creating two or more VLAN transport zones with each uplink as part of different VLAN.

Types of Edge Nodes

Edge nodes are available in two form factors – VM and bare metal. Both leverage the data plane development kit (DPDK) for faster packet processing and high performance.

4.3.1 Bare Metal Edge

NSX-T bare metal Edge runs on a physical server and is installed using an ISO file or PXE boot. The bare metal Edge is recommended for production environments where services like NAT, firewall, and load balancer are needed in addition to L3 unicast forwarding. A bare metal Edge differs from the VM form factor Edge in terms of performance. It provides sub-second convergence, faster failover, and throughput greater than 10Gbps. Hardware requirements including CPU specifics and supported NICs can be found in the NSX Edge and Bare-Metal NSX Edge Physical Hardware Requirements section of the NSX-T Administration Guide.

When a bare metal Edge node is installed, a dedicated interface is retained for management. If redundancy is desired, two NICs can be used for management plane high availability. For each NIC on the server, an internal interface is created following the naming scheme “fp-ethX”. These internal interfaces are assigned to the DPDK fastpath and are allocated for overlay tunneling traffic or uplink connectivity to top of rack (TOR) switches. There is complete flexibility in assigning fp-eth interfaces for overlay or uplink connectivity. As there are four fp-eth interfaces on the bare metal Edge, a maximum of four physical NICs are supported for overlay and uplink traffic. In addition to primary interfaces, a 1G out-of-band NIC can also be used as a management interface.

There can only be one teaming policy per N-VDS. To develop the desired connectivity (e.g., explicit availability and traffic engineering), more than one N-VDS per Edge node may be required. Each N-VDS instance can have a unique teaming policy, allowing for flexible design choices.

Figure 4‑16 shows two of several variations of an NSX-T bare metal Edge node. The left side of the diagram shows the bare metal Edge with three physical NICs. Management traffic, overlay tunnel traffic, and uplink traffic each have a dedicated physical NIC. As noted above, there can be only one teaming policy per N-VDS. Since overlay traffic and uplink traffic use different physical NICs, two different N-VDS – “Overlay N-VDS” and “Ext N-VDS” – are used.

“Overlay N-VDS” uses the teaming policy that defines pNIC “P2” as active, while

“Ext N-VDS” has pNIC “P3” as its unique active uplink.

The right side of the diagram shows the bare metal Edge with two physical NICs. One NIC is dedicated for management traffic. Overlay and uplink traffic share the same teaming policy and same physical NIC. Hence, one N-VDS is sufficient.

Figure 4‑16: Bare Metal pNIC and Traffic Control

Both topology options shown are susceptible to single point of failure. If either management or overlay link goes down, the bare metal node becomes unusable. A failure of any bare metal component will either force a standby node to take over the forwarding (in case of active/standby services selection) or a reduction of total forwarding capacity (in active/active ECMP design). The alternative is to use LAG as an option. These design recommendations are discussed in detail in Edge Node and Service Design section.

Figure 4‑17 shows another variation of NSX-T bare metal Edge with five physical NICs.

Management traffic has a dedicated NIC, overlay tunnel traffic has two NICs dedicated in active/standby, and one NIC is dedicated for each uplink. This variation provides redundancy for both overlay and uplink traffic. Another link could be added to provide redundancy for management traffic, but no additional NICS can be used overlay or uplink traffic as the following setup already has a maximum of 4 NICs assigned to fastpath. Three N-VDS have been used in this example: “Overlay N-VDS” has a teaming policy that defines “P2” as the active uplink and “P3” as standby for overlay traffic; “Ext 1 N-VDS” has pNIC “P4” as the unique active uplink; and “Ext 2 N-VDS” has pNIC “P5”as the unique active uplink.

Figure 4‑17: Bare Metal with Dedicated pNICs for each Traffic Type

Depending upon the network design, uplink(Ext) may be configured in the same different VLANs. Figure 4-18 VLANs shows a logical and physical topology where a tier-0 LR has two uplinks. These two uplinks are provided by bare metal Edge nodes “EN1” and “EN2”. Both the Edge nodes are in the same rack and connect to TOR switches in that rack. “EN1” provides “Ext 1” while “EN2” provides “Ext 2” connectivity to the TORs. The TORs have a trunk link carrying VLAN 100 to extend the L2 domain between two Edge nodes. On each Edge nodes, one of the NICs is dedicated to send and receive overlay tunnel traffic to and from compute hypervisors. In event of a link, Edge node “EN1”, or TOR failure, connectivity to the physical infrastructure is not impacted because of the active Edge node “EN2”.

Figure 4‑18: Single External VLAN with Dual-pNIC Connectivity

Figure 4‑19 shows a logical and physical topology where VLANs used for uplink connectivity on Edge nodes are different. Similar to the previous example, there are two Edge nodes providing uplink connectivity to the TORs. “EN1” provides “Ext 1” connectivity in VLAN 100 and “EN2” provides “Ext 2” connectivity in VLAN 200. The same uplink N-VDS serves both VLANs. On “EN1”, tier-0 LR connects to the uplink N-VDS via a VLAN backed logical switch in VLAN 100. On “EN2”, tier-0 LR connects to the same uplink N-VDS via a VLAN backed logical switch in VLAN 200. Similar to the previous topology, on each Edge node, one of the NICs is dedicated to send and receive overlay tunnel traffic to and from compute hypervisors.

Figure 4‑19: Two External VLANs with Two pNICs

Figure 4‑20 details another topology where the Edge node has point-to-point connectivity to the TORs. Each Edge node has four physical NICs. “EN1” provides “Ext 1” and “Ext 2” in VLAN 100 and VLAN 200, respectively. Similarly, “EN2” provides “Ext 1” and “Ext 2” for VLANs 300 and 400. Both “EN1” and “EN2” have one dedicated NIC for management and one for overlay traffic. In event of a NIC failure on an Edge node or TOR, or complete TOR failure, the Edge node does not lose connectivity to the physical infrastructure and both Edge nodes actively forward north-south traffic.

Figure 4‑20: Two External VLANs with Four pNICs

4.3.2 VM Form Factor

NSX-T VM Edge in VM form factor can be installed using an OVA, OVF, or ISO file.

Depending on the required functionality, there are deployment-specific VM form factors. These are detailed in below table

Size

Memory

vCPU

Disk

Specific Usage Guidelines

Small

4GB

2

120 GB

PoC only, LB functionality is not available.

Medium

8GB

4

120 GB

Suitable for production with centralized services like NAT, Edge firewall. Load balancer functionality can be leveraged for POC as well.

Large

16GB

8

120 GB

Suitable for production with centralized services like NAT, Edge firewall, load balancer.

Bare Metal Edge

128GB

8

120 GB

Suitable for production with centralized services like NAT, Edge firewall, load balancer where higher performance in excess of 10Gbps.

Table 4‑1: Edge VM Form Factors and Usage Guideline

When NSX-T Edge is installed as a VM, vCPUs are allocated to the Linux IP stack and DPDK. The number of vCPU assigned to a Linux IP stack or DPDK depends on the size of the Edge VM. A medium Edge VM has two vCPUs for Linux IP stack and two vCPUs dedicated for DPDK. This changes to four vCPUs for Linux IP stack and four vCPUs for DPDK in a large size Edge VM.

An NSX-T Edge VM has four internal interfaces: eth0, fp-eth0, fp-eth1, and fp-eth2. Eth0 is reserved for management, while the rest of the interfaces are assigned to DPDK fastpath. These interfaces are allocated for uplinks to TOR switches and for NSX-T overlay tunneling. The interface assignment is flexible for either uplink or overlay. As an example, fp-eth0 could be assigned for overlay traffic with fp-eth1, fp-eth2, or both for uplink traffic.

There can only be one teaming policy per N-VDS. To develop desired connectivity (e.g., explicit availability and traffic engineering), more than one N-VDS per Edge node may be required. Each N-VDS instance can have a unique teaming policy, allowing for flexible design choices. Since an Edge VM runs on ESXi, it connects to a VDS; this provides flexibility in assigning a variety of teaming policies. Each vNIC may be mapped to a dedicated port group in the ESXi hypervisor, offering maximum flexibility in the assignment of the different kinds of traffic engineering.

In Figure 4‑21 there are two ESXi hosts, each with two physical NICs. Edges “VM1” and “VM 2” are hosted on different ESXi hosts, each connected to both TOR switches.

Four port groups have been defined on the VDS to connect the Edge VM; these are named “Mgmt PG”, “Transport PG”, “Ext1 PG” (VLAN 100), and “Ext2 PG” (VLAN 200). While this example uses a VDS, it would be the same if a VSS were selected. Use of a VSS is highly discouraged due to the support and flexibility benefits provided by the VDS. “Uplink 1” on the VDS is mapped to the first pNIC that connects to the TOR switch on the left and “Uplink 2” is mapped to the remaining pNIC that connects to the TOR switch on the right.

This figure also shows three N-VDS, named as “Overlay N-VDS”, “Ext 1 N-VDS”, and “Ext 2 N-VDS”. Three N-VDS are used in this design due to fact that there can only be one teaming policy per N-VDS and desired is to control the traffic exiting from the “uplink” of the ESXi VDS.

Overlay N-VDS uses the teaming policy that defines “Uplink1” as active and “Uplink2” as standby on both Edge VMs. “Ext-1 N-VDS” has “Uplink1” as unique active uplink. “Ext1 PG” (VLAN 100) is mapped to use “Uplink1”. “Ext-2 N-VDS” has “Uplink2” as unique active uplink. “Ext2 PG” (VLAN 200) is mapped to use “Uplink2”. This configuration ensures that the traffic sent on VLAN 100 always uses “Uplink1” and is sent to the left TOR switch. Traffic sent on VLAN 200 uses “Uplink2” and is sent to the TOR switch on the right.

Figure 4‑21: Edge Node VM with Two pNICs

4.3.3 Edge Cluster

An Edge cluster is a group of homogeneous Edge transport nodes – all VM or all bare metal – with common properties. It provides scale out, redundant, and high-throughput gateway functionality for logical networks. Scale out from the logical networks to the Edge nodes is achieved using ECMP. There is total flexibility in assigning logical routers to Edge nodes and clusters. Tier-0 and tier-1 routers can be hosted on either same or different Edge clusters. Note that centralized services must be enabled for the tier-1 LR to coexist in the same cluster, a configuration is shown in Figure 4‑22.

Figure 4‑22: Edge Cluster with Tier-0 and Tier 1 Services

Depending upon the services hosted on the Edge node and their usage, an Edge cluster could be dedicated simply for running centralized services (e.g., NAT). Figure 4‑23 shows two clusters of Edge nodes. Edge Cluster 1 is dedicated for tier-0 routers only and provides uplink connectivity to the physical infrastructure. Edge Cluster 2 is responsible for NAT functionality on tier-1 routers.

Figure 4‑23: Multiple Edge Clusters with Dedicated Tier-0 and Tier-1 Services

There can be only one tier-0 LR per Edge node; however, multiple tier-1 LRs can be hosted on one Edge node. Edge VM and bare metal Edge cannot coexist in the same cluster. Edge VMs of different sizes can be combined in the same cluster; however, it is not recommended.

A maximum of 8 Edge nodes can be grouped in an Edge cluster. A tier-0 LR supports a maximum of eight equal cost paths, thus a maximum of eight Edge nodes are supported for ECMP. Edge nodes in an Edge cluster run Bidirectional Forwarding Detection (BFD) on both tunnel and management networks to detect Edge node failure. The BFD protocol provides fast detection of failure for forwarding paths or forwarding engines, improving convergence. Edge VMs support BFD with minimum BFD timer of one second with three retries, providing a three second failure detection time. Bare metal Edges support BFD with minimum BFD tx/rx timer of 300ms with three retries which implies 900ms failure detection time.

A tier-0 LR hosted on an Edge node supports eight equal cost paths northbound. To reduce the number of BFD sessions running in an Edge cluster and provide maximum N-S throughput, eight Edge nodes in an Edge cluster is the recommended configuration.

4.4 Routing Capabilities

NSX-T supports static routing and the dynamic routing protocol eBGP on tier-0 LRs. Tier-1 LRs support static routes but do not support any dynamic routing protocols.

4.4.1 Static Routing

Northbound, a static route can be configured on tier-1 LRs with the next hop IP as the routerlink IP of the tier-0 LR (100.64.0.0/10 range). Tier-0 LRs can be configured with a static route toward external subnets with a next hop IP of the physical router. The management plane configures a default route on “Tier-1 LR” with next hop IP address as the routerlink IP of “Tier-0 LR”, 100.64.224.0/31.

ECMP is supported with static routes to provide load balancing, increased bandwidth, and fault tolerance for failed paths or Edge nodes. Figure 4-24 shows a tier-0 LR connected to two physical routers with two equal cost static default routes configured for ECMP. Up to eight paths are supported in ECMP. The current hash algorithm for ECMP is two-tuple, based on source and destination IP of the traffic.

 

 

Figure 4‑24: Static Routing Configuration

BFD can also be enabled for faster failure detection of next hop and is configured in the static route. BFD tx/rx timer can range from a minimum of 300ms to maximum of 10,000ms. Default BFD tx/rx timers are set to 1000ms with three retries.

4.4.2 Dynamic Routing

BGP is the de facto protocol on the WAN and in most modern data centers. A typical leaf-spine topology has eBGP running between leaf switches and spine switches.

Tier-0 LRs support eBGP on the uplink connection with physical routers. BFD can also be enabled per eBGP neighbor for faster failover. BFD timers depend on the Edge node type. As discussed in the Edge section, bare metal Edge supports a minimum of 300ms tx/rx BFD timer while the VM form factor Edge supports a minimum of 1000ms tx/rx BFD timer.

With NSX-T 2.0 release, the following eBGP features are supported:

  • Two and four bytes AS numbers in both asplain and asdot format.
  • eBGP multi-hop support, allowing eBGP peering to be established on loopback interfaces.
  • eBGP multi-hop BFD
  • BGP route aggregation support with the flexibility of advertising a summary route only to the eBGP peer or advertise the summary route along with specific routes. A more specific route must be present in the routing table to advertise a summary route.
  • Route redistribution in eBGP to advertise NSX connected and NSX static routes, learned by the tier-1 LR, to the eBGP peer.
  • Inbound/outbound route filtering with eBGP peer using prefix-lists or route-maps.
  • Influencing eBGP path selection by setting weight, AS Path Prepend, or MED.
  • BGP communities can be set in a route-map to facilitate matching of communities at the upstream router.
  • BGP well-known community names (e.g., no-advertise, no-export, no-export-subconfed) can also be included in the eBGP route updates to the eBGP peer.
  • Graceful restart in eBGP. Further details in section 4.3.2.1.

The logical topology in Figure 4‑25 shows eBGP peering on a tier-0 LR with two physical routers.

Figure 4‑25: eBGP Dynamic Routing Configuration

In this example, the tier-0 LR is hosted on two Edge nodes. The uplink port on “EN1” connects to physical router 1 and the uplink port on “EN2” connects to physical router 2. eBGP is established with both the physical routers.

Northbound, the tier-1 LR advertises connected, static, or tier-1 NAT routes to the tier-0 LR. The tier-0 LR sees connected routes on tier-1 LR as NSX static. Tier-0 redistributes NSX static and any other routes into eBGP. The management plane configures a default route on “Tier-1 LR” with next hop IP address as the routerlink IP of “Tier-0 LR”, 100.64.224.0/31.

Active/active ECMP services supports up to eight paths. The ECMP hash algorithm is 2-tuple and based on the source and destination IP addresses of the traffic. The hashing algorithm determines how incoming traffic is forwarded to the next-hop device when there are multiple paths.

Figure 4‑26 shows the ECMP routing topology where a tier-0 hosted on the Edge nodes “EN1” and “EN2” has equal cost routes from the upstream physical routers.

Figure 4‑26: ECMP and eBGP Peering

Graceful Restart (GR)

Graceful restart in eBGP allows a eBGP speaker to preserve its forwarding table while the control plane restarts. An eBGP control plane restart could happen due to a supervisor switchover in a dual supervisor hardware, planned maintenance, or active routing engine crash. As soon as a GR-enabled router restarts, it preserves its forwarding table, marks the routes as stale, and sets a grace period restart timer for the eBGP session to reestablish. If the eBGP session reestablishes during this grace period, route revalidation is done, and the forwarding table is updated. If the eBGP session does not reestablish within this grace period, the router flushes the stale routes.

The eBGP session will not be GR capable if only one of the peers advertises it in the eBGP OPEN message; GR needs to be configured on both ends. GR can be enabled/disabled per logical router. The GR restart timer is 120 seconds.

4.5 Services High Availability

NSX Edge nodes run in an Edge cluster, hosting centralized services and providing connectivity to the physical infrastructure. Since the services are run on the SR component of logical router, the following concept is relevant to SR. This SR service runs on an Edge node and has two modes of operation – active/active or active/standby.

4.5.1 Active/Active

Active/Active – This is a high availability mode where an SR is hosted on more than one Edge node at a time. Stateless services such as layer 3 forwarding are destination IP based, so it does not matter which Edge node receives and forwards the traffic. If multiple ECMP paths are available, traffic can be load balanced between the links, so the SRs hosted on the Edge nodes could be in an active/active HA state.

Stateful services typically require tracking of connection state (e.g., sequence number check, connection state), thus traffic for a given session needs to go through the same Edge node. As of NSX-T 2.0, active/active HA mode does not support stateful services such as Edge firewall or stateful NAT. Stateless services, including reflexive NAT and stateless firewall, can leverage the active/active HA model.

Keepalive messages are sent on both management and tunnel interfaces of the Edge node to check the status of the peers. Figure 4‑27 shows the keepalives sent/received between Edge nodes on both management and overlay interfaces. Both the active SRs have different IPs northbound and southbound.

Figure 4‑27: Multiple Keepalive Channels

An active SR on an Edge node is declared down when one of the following conditions is met:

  • Keepalive communication channels with peer SRs are down on both management and overlay tunnel interfaces.
  • All eBGP sessions on the peer SR are down. This is only applicable on tier-0 with dynamic routing.
  • All the overlay tunnels are down to remote Edges and compute hypervisors.

There is no FIB sync between active SRs; the active SR runs stateless ECMP paths.

Graceful Restart and BFD Interaction with Active/Active – Tier-0 SR Only

If an Edge node is connected to a TOR switch that does not have the dual supervisor or the ability to retain forwarding traffic when the control plane is restarting, enabling GR in eBGP TOR does not make sense. There is no value in preserving the forwarding table on either end or sending traffic to the failed or restarting device. In case of an active SR failure (i.e., the Edge node goes down), physical router failure, or path failure, forwarding will continue using another active SR or another TOR. BFD should be enabled with the physical routers for faster failure detection.

If the Edge node is connected to a dual supervisor system that supports forwarding traffic when the control plane is restarting, then it makes sense to enable GR. This will ensure that forwarding table data is preserved and forwarding will continue through the restarting supervisor or control plane. Enabling BFD with such a system would depend on the device-specific BFD implementation. If the BFD session goes down during supervisor failover, then BFD should not be enabled with this system. If the BFD implementation is distributed such that the BFD session would not go down in case of supervisor or control plane failure, then enable BFD as well as GR.

4.5.2 Active/Standby

Active/Standby – This is a high availability mode where an SR is operational on a single Edge node at a time. This mode is required when stateful services are enabled. Services like NAT are in constant state of sync between active and standby SRs on the Edge nodes. This mode is supported on both tier-1 and tier-0 SRs. The conditions to declare an Edge node down are the same as mentioned in active/active scenario and would trigger a failover.

For tier-1, active/standby SRs have the same IP addresses northbound. Only the active SR will reply to ARP requests, while the standby SR interfaces operational state is set as down so that they will automatically drop packets.

For tier-0, active/standby SRs have different IP addresses northbound and have eBGP sessions established on both links. Both of the SRs will receive routing updates from physical routers and advertise routes to the physical routers; however, the standby SR will prepend its local AS three times in the eBGP updates so that traffic from the physical routers prefer the active SR. The placement of active and standby SR in terms of connectivity to TOR or northbound infrastructure becomes an important design choice, such that any component failure should not result in a failure of both active and standby service. This means diversity of connectivity to TOR for bare metal active/standby as well as host-specific availability consideration for Edge node VMs are important design choices. These choices are described in the design chapter.

Southbound IP addresses on active and standby SRs are the same and the operational state of standby SR southbound interface is down. Since the operational state of southbound SR interface is down, the DR does not send any traffic to the standby SR.

Figure 4‑28 shows active and standby SRs on Edge nodes “EN1” and “EN2”.

Figure 4‑28: Active and Standby Routing Control with eBGP

Graceful Restart and BFD Interaction with Active/Standby

Active/standby services have an active/active control plane with active/standby data forwarding. In this redundancy model, eBGP is established on active and standby tier-0s SR with their respective TORs. If the Edge node is connected to a system that does not have the dual supervisor or the ability to keep forwarding traffic when the control plane is restarting, enabling GR in eBGP does not make sense. There is no value in preserving the forwarding table on either end and no point sending traffic to the failed or restarting device. When the active tier-0 SR goes down, the route advertised from standby tier-0 becomes the best route and forwarding continues using the newly active SR. If the TOR switch supports BFD, it is recommended to run BFD on the both eBGP neighbors for faster failure detection.

If the Edge node is connected to a dual supervisor system that supports forwarding traffic when the control plane is restarting, then it makes sense to enable GR. This will ensure that the forwarding table is table is preserved and forwarding will continue through the restarting supervisor or control plane. Enabling BFD with such system depends on BFD implementation of hardware vendor. If the BFD session goes down during supervisor failover, then BFD should not be enabled with this system; however, if the BFD implementation is distributed such that that the BFD session would not go down in case of supervisor or control plane failure, then enable BFD as well as GR.

4.6 Other Network Services

4.6.1 Network Address Translation

Users can enable NAT as a network service on NSX-T. This is a centralized service which can be enabled on both tier-0 and tier-1 logical routers. NAT is enforced at the uplink of the LR.

Supported NAT rule types include:

  • Source NAT (SNAT): Source NAT translates the source IP of the outbound packets to a known public IP address so that the app can communicate with the outside world without using its private IP address. It also keeps track of the reply.
  • Destination NAT (DNAT): DNAT allows for access to internal private IP addresses from the outside world by translating the destination IP address when inbound communication is initiated. It also takes care of the reply. For both SNAT and DNAT, users can apply NAT rules based on 5 tuple match criteria.
  • Reflexive NAT: Reflexive NAT rules are stateless ACLs which must be defined in both directions. These do not keep track of the connection. Reflexive NAT rules can be used in cases where stateful NAT cannot be used due to asymmetric paths (e.g., user needs to enable NAT on active/active ECMP routers).

Table 4‑2 summarizes NAT rules and usage restrictions.

NAT Rules Type

Type

Specific Usage Guidelines

Stateful

Source NAT (SNAT)

Destination NAT (DNAT)

Can be enabled on both tier-0 and tier-1 logical routers

Stateless

Reflexive NAT

Can be enabled on tier-0 routers; generally used when the tier-0 is in active/active mode.

Table 4‑2: NAT Usage Guideline

NAT Service Router Placement

As a centralized service, whenever NAT is enabled, a service component or SR must be instantiated on an Edge cluster. In order to configure NAT, specify the Edge cluster where the service should run; it is also possible the NAT service on a specific Edge node pair. If no specific Edge node is identified, the platform will perform auto placement of the services component on an Edge node in the cluster using a weighted round robin algorithm.

4.6.2 DHCP Services

NSX-T provides both DHCP relay and DHCP server functionality. DHCP relay can be enabled at the logical router level and can act as relay between non-NSX managed environment and DHCP servers. DHCP server functionality can be enabled to service DHCP requests from VMs connected to NSX-managed logical switches. DHCP server functionality is a stateful service and must be bound to an Edge cluster or a specific pair of Edge nodes as with NAT functionality. DHCP server functionality is directly bound per logical switch, and it is not mandatory to have a logical router created.

4.6.3 Metadata Proxy Service

With a metadata proxy server, VM instances can retrieve instance-specific metadata from an OpenStack Nova API server. This functionality is specific to OpenStack use-cases only. Metadata proxy service runs as a service on an NSX Edge node. For high availability, configure metadata proxy to run on two or more NSX Edge nodes in an NSX Edge cluster.

4.6.4 Edge Firewall Service

Edge firewall service can be enabled on the tier-0 and tier-1 router for north-south firewalling. Similar to NAT, it is also enforced on the uplink on the tier-0 and tier-1 routers. Table 4‑3 summarizes Edge firewalling usage criteria.

Edge Firewall

Specific Usage Guidelines

Stateful

Can be enabled on both tier-0 and tier-1 logical routers.

Stateless

Can be enabled on tier-0 routers; generally used when the tier-0 is in active/active mode.

Table 4‑3: Edge Firewall Usage Guideline

Since Edge firewalling is a centralized service, it needs to run on an Edge cluster or a set of Edge nodes. This service is described in more detail in the NSX-T Security section.

4.7 Topology Consideration

This section covers a few of the many topologies that customers can build with NSX-T. NSX-T logical routing components - tier-1 and tier-0 LRs - enable flexible deployment of multi-tiered routing topologies. Topology design also depends on what services are enabled and where those services are provided at the provider or tenant level.

4.7.1 Supported Topologies

Figure 4‑29 shows three topologies with tier-0 providing ECMP for N-S traffic by leveraging multiple Edge nodes. The first topology is single-tiered where tier-0 connects directly to logical switches and provides E-W routing between subnets as well N-S routing to the physical infrastructure. The second and third topologies show the multi-tiered approach for single tenant and multiple tenants, respectively. Tier-0 is in an active/active HA model and provides multiple active paths for L3 forwarding using ECMP. The tier-0 LR can also be used to provide stateless services like reflexive NAT or stateless firewall in all three topologies.

Figure 4‑29: Various ECMP Topologies

As discussed in the two-tier routing section, centralized services can be enabled on tenant level (i.e., tier-1 LR) or provider level (i.e., tier-0 LR). Figure 4‑29Figure 4‑30 shows two topologies with centralized service configured on tier-1 LR. The first topology shows the tier-0 LR providing centralized services like stateful NAT. Since stateful services are configured, the tier-0 LR is in active/standby HA mode. The second topology shows NAT configured on a tenant tier-1 LR with active/standby HA and a tier-0 LR has with stateful services in active/active HA mode. Tier-0 is also leveraging ECMP for L3 forwarding of north-south traffic.

Figure 4‑30: Statefule and Stateless (ECMP) Services Topologies Choices at Each Tier

4.7.2 Unsupported Topologies

While the deployment of logical routing components enables customers to deploy flexible multi-tiered routing topologies, Figure 4‑31 presents topologies that are not supported. A tier-1 LR cannot be connected to the physical router directly as shown in the left topology. The middle topology shows that a tenant tier-1 LR cannot be connected directly to another tenant tier-1 LR. If the tenants need to communicate, route exchanges between two tenants tier-1 routers must be facilitated by the tier-0 LR. The rightmost topology highlights that a tier-1 LR cannot be connected to two different upstream tier-0 LRs.

Figure 4‑31: Unsupported Topologies

5 NSX-T Security

In addition to providing network virtualization, NSX-T also serves as an advanced security platform, providing a rich set of features to streamline the deployment of security solutions. This chapter focuses on NSX-T security capabilities, architecture, components, and implementation. Key concepts for examination include:

  • NSX-T distributed firewall provides stateful protection of the workload at the vNIC level. DFW enforcement occurs in the hypervisor kernel, helping deliver micro-segmentation.
  • Uniform security policy model for on premises and cloud deployment, supporting multi-hypervisor (i.e., ESXi and KVM) and multi-workload, with a level of granularity down to VM/containers attributes.
  • Agnostic compute management, supporting hypervisors managed by different compute-managers while allowing any defined micro-segmentation policy to be applied across hypervisors spanning multiple vCenter environments.
  • NSX-T Edge firewall serves as a centralized stateful firewall service for N-S traffic. Edge firewall is implemented per logical router and supported at both tier-0 and tire-1. Edge firewall is independent of NSX-T DFW from policy configuration and enforcement perspective.
  • Dynamic grouping of objects into logical constructs called NSGroups based on various criteria including tag, virtual machine name, subnet, and logical switch.
  • The scope of policy enforcement can be selective, with application or workload-level granularity.
  • Distributed Network Encryption (DNE) provides confidentiality and integrity of the data flowing through the network.
  • IP discovery mechanism dynamically identifies workload addressing.
  • SpoofGuard blocks IP spoofing at vNIC level.
  • Switch Security provides storm control and security against unauthorized traffic.

5.1 NSX-T Security Use Cases

The NSX-T security platform is designed to address the firewall challenges faced by IT admins. The NSX-T firewall is delivered as part of a distributed platform that offers ubiquitous enforcement, scalability, line rate performance, multi-hypervisor support, and API-driven orchestration. These fundamental pillars of the NSX-T firewall allow it to address many different use cases for production deployment.

One of the leading use cases NSX-T supports is micro-segmentation. Micro-segmentation enables an organization to logically divide its data center into distinct security segments down to the individual workload level, then define distinct security controls for and deliver services to each unique segment. A central benefit of micro-segmentation is its ability to deny attackers the opportunity to pivot laterally within the internal network, even after the perimeter has been breached.

VMware NSX-T supports micro-segmentation as it allows for a centrally controlled, operationally distributed firewall to be attached directly to workloads within an organization’s network. The distribution of the firewall for the application of security policy to protect individual workloads is highly efficient; rules can be applied that are specific to the requirements of each workload. Of additional value is that NSX’s capabilities are not limited to homogeneous vSphere environments. It supports the heterogeneity of platforms and infrastructure that is common in organizations today.

Figure 5‑1: Example of Micro-segmentation with NSX

Micro-segmentation provided by NSX-T supports a zero-trust architecture for IT security. It establishes a security perimeter around each VM or container workload with a dynamically-defined policy. Conventional security models assume that everything on the inside of an organization's network can be trusted; zero trust assumes the opposite - trust nothing and verify everything. This addresses the increased sophistication of networks attacks and insider threats that frequently exploit the conventional perimeter controlled approach. For each system in an organization's network, trust of the underlying network is removed. A perimeter is defined per system within the network to limit the possibility of lateral (i.e., east-west) movement of an attacker.

Implementation of a zero trust model of IT security with traditional network security solutions can be costly, complex, and come with a high management burden. Moreover, the lack of visibility for organization's internal networks can slow down implementation of a zero trust architecture and leave gaps that may only be discovered after they have been exploited. Additionally, conventional internal perimeters may have granularity only down to a VLAN or subnet – as is common with many traditional DMZs – rather than down to the individual system.

5.2 NSX-T DFW Architecture and Components

The NSX-T DFW architecture management plane, control plane, and data plane work together to enable a centralized policy configuration model with distributed firewalling. This section will examine the role of each plane and its associated components, detailing how they interact with each other to provide a scalable, topology agnostic distributed firewall solution.

Figure 5‑2: NSX-T DFW Architecture and Components

5.2.1 Management Plane

The NSX-T management plane is implemented through NSX-T Manager. Access to the NSX-T Manager is available through a GUI or REST API framework. When a firewall policy rule is configured, NSX-T Manager validates the configuration and locally stores a persistent copy. NSX-T Manager pushes user-published policies to the control plane cluster (CCP) which in turn pushes to the data plane. A typical DFW policy configuration consists of one or more sections with a set of rules using objects like IPSets, NSGroups, logical switches, and application level gateway (ALGs). For monitoring and troubleshooting, NSX-T Manager interacts with a host-based management plane agent (MPA) to retrieve DFW status along with rule and flow statistics. NSX-T Manager also collects an inventory of all hosted virtualized workloads on NSX-T transport nodes. This is dynamically collected and updated from all NSX-T transport nodes.

5.2.2 Control Plane

The NSX-T control plane consists of two components - the central control plane (CCP) and the logical control plane (LCP). The CCP consists of the NSX-T Controller clusters, while the LCP includes the user space module on all of the NSX-T transport nodes. This module interacts with the CCP to exchange configuration and state information.

From a DFW policy configuration perspective, NSX-T Controllers will receive policy rules pushed by NSX-T Manager. If the policy contains objects including logical switches or NSGroups, it converts them into IP addresses using an object-to-IP mapping table. This table is maintained by the controller and updated using an IP discovery mechanism. Once the policy is converted into a set of rules based on IP addresses, the CCP pushes the rules to the LCP on all NSX-T transport nodes.

CCP controllers utilize a hierarchy system to distribute the load of CCP-to-LCP communication. The responsibility for transport node notification is distributed across the controllers in CCP clusters based on internal hashing mechanism. For example, for 30 transport nodes with three controllers, each controller will be responsible for roughly ten transport nodes.

5.2.3 Data Plane

The distributed data plane is comprised of NSX-T transport nodes, with firewall enforcement done within the hypervisor kernel. Each transport node connects with a single CCP controller based on the management hierarchy. Once the LCP has received the policy configuration from the CCP, it pushes the firewall policy to the data plane filters for each vNIC. The LCP ensure that only relevant rules are programmed for each vNIC, optimizing the use of resources within the hypervisor.

The NSX-T transport nodes comprise the distributed data plane with DFW enforcement done at the hypervisor kernel level. Each of the transport nodes, at any given time, connects to only one of the CCP controllers, based on mastership for that node. On each of the transport nodes, once local control plane (LCP) has received policy configuration from CCP, it pushes the firewall policy and rules to the data plane filters (in kernel) for each of the virtual NICs. With the “Applied To” field in the rule or section which defines scope of enforcement, the LCP makes sure only relevant DFW rules are programmed on relevant virtual NICs instead of every rule everywhere, which would be a suboptimal use of hypervisor resources. Additional details on data plane components for both ESXi and KVM hosts explained in following sections.

5.3 NSX-T Data Plane Implementation - ESXi vs. KVM Hosts

NSX-T provides network virtualization and security services in a heterogeneous hypervisor environment, managing ESXi and KVM hosts as part of the same NSX-T cluster. The DFW is functionally identical in both environments; however, there are architectural and implementation differences depending on the hypervisor specifics.

Management and control plane components are identical for both ESXi and KVM hosts. For the data plane, they use a different implementation for packet handling. NSX-T uses N-VDS on ESXi hosts, which is derived from vCenter VDS, along with the VMware Internetworking Service Insertion Platform (vSIP) kernel module for firewalling. For KVM, the N-VDS leverages Open vSwitch (OVS) and its utilities. The following sections highlight data plane implementation details and differences between these two options.

5.3.1 ESXi Hosts- Data Plane Components

NSX-T uses N-VDS on ESXi hosts for connecting virtual workloads, managing it with the NSX-T Manager application. The NSX-T DFW kernel space implementation for ESXi is same as the implementation of NSX for vSphere – it uses the VMware Internetworking Service Insertion Platform (vSIP) kernel module and kernel IO chains filters. NSX-T does not require vCenter to be present. Figure 5‑3 provides details on the data plane components for the ESX host.

 

 

Figure 5‑3: NSX-T DFW Data Plane Components on an ESXi Host

5.3.2 KVM Hosts- Data Plane Components

NSX-T uses OVS and its utilities on KVM to provide DFW functionality, thus the LCP agent implementation differs from an ESXi host. For KVM, there is an additional component called the NSX agent in addition to LCP, with both running as user space agents. When LCP receives DFW policy from the CCP, it sends it to NSX-agent. NSX-agent will process and convert policy messages received to a format appropriate for the OVS data path. Then NSX agent programs the policy rules onto the OVS data path using OpenFlow messages. For stateful DFW rules, NSX-T uses the Linux conntrack utilities to keep track of the state of permitted flow connections allowed by a stateful firewall rule. For DFW policy rule logging, NSX-T uses the ovs-fwd module.

The MPA interacts with NSX-T Manager to export status, rules, and flow statistics. The MPA module gets the rules and flows statistics from data path tables using the stats exporter module.

Figure 5‑4: NSX-T DFW Data Plane Components on KVM

5.3.3 NSX-T DFW Policy Lookup and Packet Flow

In the data path, DFW maintains two tables: a rule table and a connection tracker table. LCP populates the rule table with the configured policy rules, while the connection tracker table is updated dynamically to cache flows permitted by rule table. NSX-T DFW can allow for a policy to be stateful or stateless with section-level granularity in the DFW rule table. The connection tracker table is populated only for stateful policy rules; it contains no information on stateless policies. This applies to both ESXi and KVM environments.

NSX-T DFW rules are enforced as follows:

  • Rules are processed in top-to-bottom order.
  • Each packet is checked against the top rule in the rule table before moving down the subsequent rules in the table.
  • The first rule in the table that matches the traffic parameters is enforced. The search is then terminated, so no subsequent rules will be examined or enforced.

Because of this behavior, it is always recommended to put the most granular policies at the top of the rule table. This will ensure more specific policies are enforced first. The DFW default policy rule, located at the bottom of the rule table, is a catchall rule; packets not matching any other rule will be enforced by the default rule - which is set to “allow” by default. This ensures that VM-to-VM communication is not broken during staging or migration phases. It is a best practice to then change this default rule to a “block” action and enforce access control through a whitelisting model (i.e., only traffic defined in the firewall policy is allowed onto the network). Figure 5‑5 diagrams the policy rule lookup and packet flow.

Figure 5‑5: NSX-T DFW Policy Lookup - First Packet

For an IP packet identified as “pkt1” that matches rule number 2, the order of operation is the following:

1. A lookup is performed in the connection tracker table to determine if an entry for the flow already exists.

2. As flow 3 is not present in the connection tracker table, a lookup is performed in the rule table to identify which rule is applicable to flow 3. The first rule that matches the flow will be enforced.

3. Rule 2 matches for flow 3. The action is set to “allow”.

4. Because the action is set to “allow” for flow 3, a new entry will be created inside the connection tracker table. The packet is then transmitted out of the DFW.

 

Figure 5‑6: NSX-T DFW Policy Lookup - Subsequent Packets

As shown in Figure 5‑6, subsequent packets are processed in this order:

5. A lookup is performed in the connection tracker table to check if an entry for the flow already exists.

6. An entry for flow 3 exists in the connection tracker table. The packet is transmitted out of the DFW.

5.4 NSX-T Security Policy - Plan, Design and Implement

Planning, designing, and implementing NSX-T security policy is a three-step process:

  1. Policy Methodology – Decide on the policy approach - application-centric, infrastructure-centric, or network-centric
  2. Policy Rule Model – Select grouping and management strategy for policy rules by the NSX-T DFW policy sections.
  3. Policy Consumption – Implement the policy rules using the abstraction through grouping construct and options provided by NSX-T.

5.4.1 Security Policy Methodology

This section details the considerations behind policy creation strategies to help determine which capabilities of the NSX platform should be utilized as well as how various grouping methodologies and policy strategies can be adopted for a specific design.

The three general methodologies reviewed in Figure 5‑7 can be utilized for grouping application workloads and building security rule sets within the NSX DFW. This section will look at each methodology and highlight appropriate usage.

 

Figure 5‑7: Micro-segmentation Methodologies

5.4.1.1 Application

In an application-centric approach, grouping is based on the application type (e.g., VMs tagged as “Web-Servers”), application environment (e.g., all resources tagged as “Production-Zone”) and application security posture. An advantage of this approach is the security posture of the application is not tied to network constructs or infrastructure. Security policies can move with the application irrespective of network or infrastructure boundaries, allowing security teams to focus on the policy rather than the architecture. Policies can be templated and reused across instances of the same types of applications and workloads while following the application lifecycle; they will be applied when the application is deployed and is destroyed when the application is decommissioned. An application-based policy approach will significantly aid in moving towards a self-service IT model.

An application-centric model does not provide significant benefits in an environment that is static, lacks mobility, and has infrastructure functions that are properly demarcated.

5.4.1.2 Infrastructure

Infrastructure-centric grouping is based on infrastructure components such as logical switches or logical ports, identifying where application VMs are connected. Security teams must work closely with the network administrators to understand logical and physical boundaries.

If there are no physical or logical boundaries in the environment, then an infrastructure-centric approach is not suitable.

5.4.1.3 Network

Network-centric is the traditional approach of grouping based on L2 or L3 elements. Grouping can be done based on MAC addresses, IP addresses, or a combination of both. NSX supports this approach of grouping objects. A security team needs to aware of networking infrastructure to deploy network-based policies. There is a high probability of security rule sprawl as grouping based on dynamic attributes is not used. This method of grouping works well for migrating existing rules from an existing firewall.

A network-centric approach is not recommended in dynamic environments where there is a rapid rate of infrastructure change or VM addition/deletion.

5.4.2 Security Rule Model

Policy rule models in a datacenter are essential to achieve optimal micro-segmentation strategies. The first criteria in developing a policy model is to align with the natural boundaries in the data center, such as tiers of application, SLAs, isolation requirements, and zonal access restrictions. Associating a top-level zone or boundary to a policy helps apply consistent, yet flexible control.

Global changes for a zone can be applied via single policy; however, within the zone there could be a secondary policy with sub-grouping mapping to a specific sub-zone. An example production zone might itself be carved into sub-zones like PCI or HIPAA. There are also zones for each department as well as shared services. Zoning creates relationships between various groups, providing basic segmentation and policy strategies.

A second criterion in developing policy models is identifying reactions to security events and workflows. If a vulnerability is discovered, what are the mitigation strategies? Where is the source of the exposure – internal or external? Is the exposure limited to a specific application or operating system version?

The answers to these questions help shape a policy rule model. Policy models should be flexible enough to address ever-changing deployment scenarios, rather than simply be part of the initial setup. Concepts such as intelligent grouping, tags, and hierarchy provide flexible and

agile response capability for steady state protection as well as during instantaneous threat response. The model shown in Figure 5‑8 represents an overview of the different classifications of security rules that can be placed into the NSX DFW rule table. Each of the classification shown would represent one or more sections within an NSX-T firewall rule table.

Figure 5‑8: Security Rule Model

5.4.3 Security Policy - Consumption Model

NSX-T Security policy is consumed by the firewall rule table, which is using NSX-T Manager GUI or REST API framework. When defining security policy rules for the firewall table, it is recommended to follow these high-level steps:

  • VM Inventory Collection – Identify and organize a list of all hosted virtualized workloads on NSX-T transport nodes. This is dynamically collected and saved by NSX-T Manager as the nodes – ESXi or KVM – are added as NSX-T transport nodes.
  • Tag Workload – Use VM inventory collection to organize VMs with one or more tags. Each designation consists of scope and tag association of the workload to an application, environment, or tenant. For example, a VM tag could be “Scope = Prod, Tag = web” or “Scope=tenant-1, Tag = app-1”.
  • Group Workloads – Use the NSX-T logical grouping construct with dynamic or static membership criteria based on VM name, tags, logical switch, logical port, IPSets, or other attributes.
  • Define Security Policy – Using the firewall rule table, define the security policy. Have sections to separate and identify emergency, infrastructure, environment, and application-specific policy rules based on the rule model.
  • (Optional) Define Distributed Network Encryption (DNE) Policy – This will provide data confidentiality, integrity, and authenticity for the traffic matching encryption policy rule between the workloads.

The methodology and rule model mentioned earlier would influence how to tag and group the workloads as well as affect policy definition. The following sections offer more details on grouping and firewall rule table construction with an example of grouping objects and defining NSX-T DFW policy.

5.4.3.1 Group Creation Strategies

The most basic grouping strategy is creation of a group around every application which is hosted in the NSX-T environment. Each 3-tier, 2-tier, or single-tier applications should have its own security group to enable faster operationalization of micro-segmentation. When combined with a basic rule restricting inter-application communication to only shared essential services (e.g., DNS, AD, DHCP server) this enforces granular security inside the perimeter. Once this basic micro-segmentation is in place, the writing of per-application rules can begin.

NSX-T provides multiple grouping constructs which help to group workloads. The selection of a specific policy methodology approach – application, infrastructure, or network – will help dictate which grouping construct to use:

  • IPSet: Grouping of IP addresses and subnets; a good match for the network policy methodology.
  • MAC sets: Grouping of MAC addresses, mainly used for L2 FW rules.
  • NSGroups: A flexible grouping construct, used mostly application and infrastructure methodologies.
  • NSServices: Enables definition of a new service and grouping of similar services (e.g., all infrastructure services - AD, NTP, DNS/DHCP) into one NSServices group.

NSGroups

NSX-T NSGroups are the equivalent of NSX for vSphere security groups. They allow abstraction of workload grouping from the underlying infrastructure topology. This allows a security policy to be written for either a workload or zone (e.g., PCI zone, DMZ, or production environment).

An NSGroup is a logical construct that allows grouping into a common container of static (e.g., IPSet/NSX objects) and dynamic (e.g., VM names/VM tags) elements. This is a generic construct which can be leveraged across a variety of NSX-T features where applicable.

Static criteria provide capability to manually include particular objects into the NSGroup. For dynamic inclusion criteria, boolean logic can be used to create groups between various criteria. An NSGroup constructs a logical grouping of VMs based on static and dynamic criteria. Table 5-1 shows selection criteria based on NSX Objects.

NSX Object

Description

IPSet

All VMs/vNICs connected to this logical switch segment will be selected.

Logical Switch

All VMs/vNICs connected to this logical switch segment will be selected.

NSGroup

Nested NSGroup scenario - all VMs/vNICs defined within the NSGroup will be selected.

Logical Port

This particular vNIC instance (i.e., logical port) will be selected.

MAC Set

Selected MAC sets container will be used. MAC sets contain a list of individual MAC addresses.

Table 5‑1: NSX-T Objects used for NSGroups

Table 5-2 list the selection criteria based on VM properties. The use of NSGroups gives more flexibility as an environment changes over time. Even if the rules contain only IP addresses, NSX-T provides a grouping object called an IPSet that can group IP addresses. This can then be used in NSGroups. This approach has three major advantages:

  • Rules stay more constant for a given policy model, even as the datacenter environment changes. The addition or deletion of workloads will affect group membership alone, not the rules.
  • Publishing a change of group membership to the underlying hosts is more efficient than publishing a rule change. It is faster to send down to all the affected hosts and cheaper in terms of memory and CPU utilization.
  • As NSX adds more grouping object criteria, the group criteria can be edited to better reflect the datacenter environment.

VM Property

Description

VM Name

All VMs that contain/equal/starts with the string as part of their name.

Security Tags

All VMs that are applied with specified NSX security tags

Table 5‑2: VM Properties used for NSGroups

 

Using Nesting of Groups

Groups can be nested. An NSGroup may contain multiple NSGroups or a combination of NSGroups and other grouping objects. A security rule applied to the parent NSGroup is automatically applied to the child NSGroups.

In the example shown in Figure 5‑9, three NSGroups have been defined with different inclusion criteria to demonstrate the flexibility and the power of NSGroup grouping construct.

  • Using dynamic inclusion criteria, all VMs with name starting by “WEB” are included in NSGroup named “SG-WEB”.
  • Using dynamic inclusion criteria, all VMs containing the name “APP” and having a tag “Scope=PCI” are included in NSGroup named “SG-PCI-APP”.
  • Using static inclusion criteria, all VMs that are connected to a logical switch “LS-DB” are included in NSGroup named “SG-DB”.

Nesting of NSGroup is also possible; all three of the NSGroups in the list above could be children of a parent NSGroup named “SG-APP-1-AllTier”. This organization is also shown in Figure 5‑9.

Figure 5‑9: NSGroup and Nested NSGroup Example

Efficient Grouping Considerations

Calculation of groups adds a processing load to the NSX management and control planes. Different grouping mechanisms add different types of loads. Static groupings are more efficient than dynamic groupings in terms of calculation. At scale, grouping considerations should take into account the frequency of group changes for associated VMs. A large number of group changes applied frequently means the grouping criteria is suboptimal.

5.4.3.2 Define Policy using DFW Rule Table

The NSX-T DFW rule table starts with a default rule to allow any traffic. An administrator can add multiple sections on top of default section with rules based on the specific policy model. In the data path, the packet lookup will be performed from top to bottom. Any packet not matching an explicit rule will be enforced by the last rule in the table (i.e., default rule). This final rule is set to the “allow” action by default, but it can be changed to “block” if desired. It is recommended to use the DFW with the default rule set to block, then create explicit rules for allowed traffic (i.e., a whitelist model).

The NSX-T DFW enables policy to be stateful or stateless with section-level granularity in the DFW rule table. By default, NSX-T DFW is a stateful firewall; this is a requirement for most deployments. In some scenarios where an application has less network activity, the stateless section may be appropriate to avoid connection reset due to inactive timeout of the DFW stateful connection table.

Name

Rule ID

Source

Destination

Service

Action

Applied To

Log

Stats

Direction

Table 5‑3: Policy Rule Fields

A policy rule within a section is composed of field shown in Table 5-3 and its meaning is described below

Rule Number: The precedence/position of the rule from top to bottom (i.e., order of evaluation). It is assigned top to bottom across sections by NSX-T Manager.

Rule Name: User field; supports up to 30 characters.

Rule ID: Unique number given to a rule.

Source and Destination: Source and destination fields of the packet. Possible entries are:

  •  IPv4/IPv6 addresses or subnets
  • NSX-T objects as defined in Table 5-4

All permutations are possible for the source/destination field. IP address/subnet and NSX-T group objects can be used individually or simultaneously.

Object

Description

IPSet

Selected IPSet container will be used.

IPSet contains individual IP address or IP subnet or range or IP addresses.

Logical Switch

All VM/vNIC connected to this Logical Switch segment will be selected.

 NSGroup

All VM/vNIC defined within the NSGroup will be selected.

Logical port

This particular vNIC (logical port) instance will be selected.

Table 5‑4: NSGroup Resource Values

 Service: Predefined services, predefined services groups, or raw protocols can be selected. When selecting raw protocols like TCP or UDP, it is possible to define individual port numbers or a range. There are four options for the services field:

  • Pre-defined NSService – A pre-defined NSService from the list of available objects.
  • Add Custom NSServices – Define custom services by clicking on the “Create New NSService” option. Custom services can be created based on L4 Port Set, application level gateways (ALGs), IP protocols, and other criteria. This is done using the “type of service” option in the configuration menu. When selecting an L4 port set with TCP or UDP, it is possible to define individual destination ports or a range of destination ports. When selecting ALG, select supported protocols for ALG from the list. ALGs are only supported in stateful mode; if the section is marked as stateless, the ALGs will not be implemented. Additionally, some ALGs may be supported only on ESXi hosts, not KVM. Please review release-specific documentation for supported ALGs and hosts.
  • Predefined NSService Group – A predefined NSServices group the list of available objects.
  • Custom NSServices Group – Define a custom NSServices group, selecting from single or multiple services from within NSX.

Action: Define enforcement method for this policy rule; available options are listed in Table 5-5

Action

Description

Drop

Block silently the traffic.

Allow

Allow the traffic.

Reject

 

Reject action will send back to initiator:

RST packets for TCP connections.

ICMP unreachable with network administratively prohibited code for UDP, ICMP and other IP connections.

Table 5‑5: Firewall Rule Table – “Action” Values

Applied To: Define the scope of rule publishing. The policy rule could be published all clusters where DFW was enabled or restricted to a specific object as listed in Table 5-6. “Applied To” is supported both at DFW rule level and section level.

Object

Description

Logical Switch

Selecting logical switch will push the rule down to all VMs/vNICs connected on this logical switch segment.

 NSGroup

Selecting NSGroup will push the rule down to all VMs/vNICs defined within the NSGroup.

vNIC/Logical Port

Selecting vNIC (i.e., logical port) will push the rule down to this particular vNIC instance.

Table 5‑6: Firewall Rule Table - "Applied To" Values

Log: Enable or disable packet logging. When enabled, each DFW enabled host will send dfwpktlogs to the configured syslog server. This information can be used to build alerting and reporting based on the information within the logs, such as dropped or allowed packets.

Stats: Provides packets/bytes/sessions statistics associated with that rule entry.

Direction: This field matches the direction of the packet and is only relevant for stateless ACLs. It can match packet exiting the VM, entering the VM, or both directions.

Comments: This field can be used for any free flowing string and is useful to store comments.

Examples of Policy Rules for 3 Tier Application

Figure 5‑10 shows a standard 3-tier application topology used to define NSX-T DFW policy. Three web servers are connected to “Web LS”, two applications servers are connected to “App LS”, and 2 DB servers connected to “DB LS”. A distributed logical router is used to interconnect the three tiers by providing inter-tier routing. NSX-T DFW has been enabled, so each VM has a dedicated instance of DFW attached to its vNIC/logical port.

Figure 5‑10: 3-Tier Application Network Topology

NSX offers multiple ways to define DFW policy rule configuration. A policy rule can be defined in the default section on top of default rule, or in a separate section created specifically for this application. The recommended approach is to have separate DFW table sections and rules within them for each application.

The following use cases employ present policy rules based on the different methodologies introduced earlier.

Example 1: Static IP addresses/subnets in security policy rule.

This example shows use of the network methodology to define policy rule. Firewall policy configuration would look as shown in Table 5-7.

Name

Source

Destination

Service

Action

Applied To

Any to Web

Any

172.16.10.0/24

https

Allow

All

Web to App

172.16.10.0/24

172.16.20.0/24

<Enterprise Service Bus>

Allow

All

App to DB

172.16.20.0/24

172.16.30.0/24

SQL

Allow

All

Default

Any

Any

Any

Drop

All

Table 5‑7: Firewall Rule Table - Example 1

The DFW engine is able to enforce network traffic access control based on the provided information. To use this type of construct, exact IP information is required for the policy rule. This construct is quite static and does not fully leverage dynamic capabilities with modern cloud systems.

Example 2: Using Logical Switch object in Security Policy rule.

This example uses the infrastructure methodology to define policy rule. Table 5-8 shows the firewall policy configurations.

Name

Source

Destination

Service

Action

Applied To

Any to Web

Any

Web LS

https

Allow

WEB LS

Web to App

Web LS

App LS

<Enterprise Service Bus>

Allow

WEB LS

APP LS

App to DB

App LS

DB LS

SQL

Allow

APP LS

DB LS

Default

Any

Any

Any

Drop

All

Table 5‑8: Firewall Rule Table - Example 2

Reading this policy rule table would be easier for all teams in the organization, ranging from security auditors to architects to operations. Any new VM connected on any logical switch will be automatically secured with the corresponding security posture. For instance, a newly installed web server will be seamlessly protected by the first policy rule with no human intervention, while VM disconnected from a logical switch will no longer have a security policy applied to it. This type of construct fully leverages the dynamic nature of NSX. It will monitor VM connectivity at any given point in time, and if a VM is no longer connected to a particular logical switch, any associated security policies are removed.

This policy rule also uses the “Applied To” option to apply the policy to only relevant objects rather than populating the rule everywhere. In this example, the first rule is applied to the vNIC associated with “Web LS” and “App LS”. Use of “Applied To” is recommended to define the enforcement point for the given rule for better resource usage.

Example 3: Using NSGroup Object in Security Policy Rule.

This final example looks at the application methodology of policy rule definition. NSGroups in this example are identified in Table 5-9 while the firewall policy configuration is shown in Table 5-10.

NSGroup name

NSGroup definition

SG-WEB

Static inclusion: Web LS

SG-APP

Static inclusion: App LS

SG-DB

Static inclusion: DB LS

Table 5‑9: Firewall Rule Table - Example 3 NSGroups

Name

Source

Destination

Service

Action

Applied To

Any to Web

Any

SG-WEB

https

Allow

SG-WEB

Web to App

SG-WEB

SG-APP

<Enterprise Service Bus>

Allow

SG-WEB

SG-APP

App to DB

SG-APP

SG-DB

SQL

Allow

SG-APP

SG-DB

Default

Any

Any

Any

Drop

All

Table 5‑10: Firewall Rule Table - Example 3 Layout

All of the advantages of example two remain in this case, with the use of NSGroup providing additional flexibility. In this instance, “Applied To” is used with NSGroup objects to define the enforcement point. Using both static and dynamic inclusion, it is possible to define in a granular strategy for selecting objects for inclusion in this container. Writing DFW policy rules using NSGroups can dramatically reduce the number of rules needed in the enterprise and provide the most comprehensible security policy configuration.

Security policy and IP Discovery

NSX-T DFW/Edge FW has a dependency on VM-to-IP discovery which is used to translate objects to IP before rules are pushed to data path. This is mainly required when the policy is defined using grouped objects. This VM-to-IP table is maintained by NSX-T Controllers and populated by the IP discovery mechanism. IP discovery used as a central mechanism to ascertain the IP address of a VM. By default this is done using DHCP and ARP snooping, with VMTools available as another mechanism with ESXi hosts. These discovered VM-to-IP mappings can be overridden by manual input if needed, and multiple IP addresses are possible on a single vNIC. The IP and MAC addresses learned are added to the VM-to-IP table. This table is used internally by NSX-T for SpoofGuard, ARP suppression, and firewall object-to-IP translation.

5.5 Additional Security Features

NSX-T extends the security solution beyond DFW with additional features to enhance data center security posture on top of micro-segmentation. These features include:

  • Distributed Network Encryption (DNE) - This is optional security capability that can selectively encrypt E-W traffic at the network layer when an application does not provide a secure channel. This is enforced at hypervisor kernel level to ensure data confidentiality, integrity, and authenticity between NSX-T nodes. In addition, DNE provides replay checking functionality. Encryption rules are aligned to existing NSX-T DFW firewall rules, simplifying the alignment of encryption policies to application boundaries. Encryption rules match on the same constructs used in DFW rules - source, destination, and services - with actions to encrypt and check integrity, check integrity only, or pass data with no encryption. The NSX-T DNE supports granular, rule-based group key management. DNE uses a symmetric key cryptographic technique with MAC algorithm AES GCM with 128 bits key, with configurable key rotation policy. DNE relies on a VMware-provided DNE key manager appliance for DNE key management. DNE uses the ESP format to encrypt the network packet. DNE encapsulation is done as the last action in the packet processing before leaving the transport node after overlay encapsulation and the first action on the receiving side followed by overlay decapsulation.
  • SpoofGuard - Provides protection against spoofing with MAC+IP+VLAN bindings. This can be enforced at a per logical port level. The SpoofGuard feature requires static or dynamic bindings (e.g., DHCP/ARP snooping) of IP+MAC for enforcement.
  • Switch Security - Provides stateless L2 and L3 security to protect logical switch integrity by filtering out malicious attacks (e.g., denial of service using broadcast/multicast storms) and unauthorized traffic entering logical switch from VMs. This is accomplished by attaching the switch security profile to a logical switch for enforcement. The switch security profile has options to allow/block bridge protocol data unit (BPDU), DHCP server/client traffic, non-IP traffic. It allows for rate limiting of broadcast and multicast traffic, both transmitted and received.

5.6 NSX-T Security Deployment Options

The NSX-T security solution is agnostic to overlay deployment mode and management agent selection. No changes are required for policy planning, design, and implementation based on the following criteria:

  • Overlay vs. VLAN-Backed Logical Switch - An NSX-T security solution can be deployed in both overlay-backed and VLAN-backed modes. It is important to migrate all VMs in both of the logical switch modes. Please refer to NSX-T Logical Switching chapter for additional details on switch mode options.
  • vCenter Management of ESXi hosts – The NSX-T DFW does not require vCenter management. Use of vCenter to manage ESXi hosts does not change NSX-T policy configuration or implementation. NSX-T only requires ESXi hosts to be part of transport nodes to enable and implement DFW policies. An ESXi transport node itself can be standalone or managed by vCenter for VM/compute management. In addition, an NSX-T cluster can have ESXi nodes spanning multiple vCenters if required by the compute management design.

5.7 Edge FW

The NSX-T Edge firewall provides essential perimeter firewall protection which can be used in addition to a physical perimeter firewall. Edge firewall service is part of the NSX-T Edge node for both bare metal and VM form factors. The Edge firewall is useful in developing PCI zones, multi-tenant environments, or DevOps style connectivity without forcing the inter-tenant or inter-zone traffic onto the physical network. The Edge firewall data path uses DPDK framework supported on Edge to provide better throughput.

5.7.1 Consumption

NSX-T Edge firewall is instantiated per logical router and supported at both tier-0 and tier-1. Edge firewall works independent of NSX-T DFW from a policy configuration and enforcement perspective. A user can consume the Edge firewall using either the GUI and REST API framework provided by NSX-T Manager. On NSX-T Manager, the UI Edge firewall can be configured from the “router -> services -> Edge firewall” page. The Edge firewall configuration is similar to DFW firewall policy; it is defined as a set of individual rules within a section. Like DFW, the Edge firewall rules can use logical objects, tagging and grouping constructs (e.g., NSGroups) to build policies. Similarly, regarding L4 services in a rule, it is valid to use predefined NSServices, custom NSServices, predefined service groups, custom service groups, or TCP/UDP protocols with the ports. NSX-T Edge firewall also supports multiple Application Level Gateways (ALGs). The user can select an ALG and supported protocols by using the other setting for type of service. Edge FW supports only FTP and TFTP as part of ALG. ALGs are only supported in stateful mode; if the section is marked as stateless, the ALGs will not be implemented.

5.7.2 Implementation

 Edge firewall is an optional centralized firewall implemented on NSX-T tier-0 router uplinks and tier-1 router links. This is implemented on a tier-0/1 SR component which is hosted on NSX-T Edge. Tier-0 Edge firewall supports stateful firewalling only with active/standby HA mode. It can also be enabled in an active/active mode, though it will be only working in stateless mode. Edge firewall uses a similar model as DFW for defining policy, and NSX-T grouping construct (e.g., NSGroups, IPSets, etc.) can be used as well. Edge firewall policy rules are defined within the dedicated section in the firewall table for each tier-0 and tier-1 router.

NSX-T Edge firewall has implications for the data path when enabled with NAT service on the same logical router. If a flow matches both NAT and Edge firewall, NAT lookup result takes precedence over firewall; thus the firewall will not be applied to that flow. If flow matches only a firewall rule, then the firewall lookup result will be honored for that flow.

5.7.3 Deployment Scenarios

This section provides two examples for possible deployment and data path implementation.

Edge FW as Perimeter FW at Virtual and Physical Boundary

The tier-0 Edge firewall is used as perimeter firewall between physical and virtual domains. This is mainly used for N-S traffic from the virtualized environment to physical world. In this case, the tier-1 SR component which resides on the Edge node enforces the firewall policy before traffic enters or leaves the NSX-T virtual environment. The E-W traffic continues to leverage the distributed routing and firewalling capability which NSX-T natively provides in the hypervisor.

Figure 5‑11: Tier-0 Edge Firewall - Virtual to Physical Boundary

Edge FW as Inter-tenant FW

The tier-1 Edge firewall is used as inter-tenant firewall within an NSX-T virtual domain. This is used to define policies between different tenants who resides within an NSX environment. This firewall is enforced for the traffic leaving the tier-1 router and uses the tier-0 SR component which resides on the Edge node to enforce the firewall policy before sending to the tier-0 router for further processing of the traffic. The intra-tenant traffic continues to leverage distributed routing and firewalling capabilities native to the NSX-T.

Figure 5‑12: Tier-1 Edge Firewall - Inter-tenant

5.8 Recommendation for Security Deployments

This list provides best practices and recommendation for the NSX-T DFW. These can be used as guidelines while deploying an NSX-T security solution.

  • For individual NSX-T software releases, always refer to release notes, compatibility guides, and recommended configuration maximums.
  • Exclude management components, NSX-T Manager, vCenter, and security tools from the DFW policy to avoid lockout. This can be done by adding the those VMs to the exclusion list.
  • Choose the policy methodology and rule model to enable optimum groupings and policies for micro-segmentation.
  • Use NSX-T tagging and grouping constructs to group an application or environment to its natural boundaries. This will enable simpler policy management.
  • Consider the flexibility and simplicity of a policy model for day-2 operations. It should address ever-changing deployment scenarios rather than simply be part of the initial setup.
  • Leverage separate DFW sections to group and manage policies based on the chosen rule model. (e.g., emergency, infrastructure, environment, application, or tenant sections.)
  • Use a whitelist model; create explicit rules for allowed traffic and change DFW the default rule from “allow” to “block”.

6 NSX-T Design Considerations

This section examines the technical details of a typical NSX-T-based enterprise data center design. It looks at the physical infrastructure and requirements and discusses the design considerations for specific components of NSX-T. Central concepts include:

  • Connectivity of management and control plane components: NSX-T Manager and NSX-T Controllers.
  • Design for connecting the compute hosts with both ESXi and KVM hypervisors.
  • Design for the NSX-T Edge and Edge clusters.
  • Organization of compute domains and NSX-T resources.
  • Review of sample deployment scenarios.

6.1 Physical Infrastructure of the Data Center

An important characteristic of NSX-T is its agnostic view of physical device configuration, allowing for great flexibility in adopting a variety of underlay fabrics and topologies. Basic physical network requirements include:

  • IP Connectivity – IP connectivity between all components of NSX-T and compute hosts. This includes management interfaces in hosts as well Edge nodes - both bare metal and virtual Edge nodes.
  • Jumbo Frame Support – A minimum MTU of 1700 bytes is required to address the full possibility of variety of functions and futureproof the environment for an expanding GENEVE header. As the recommended MTU for the N-VDS is 9000, the underlay network should support at least this value, excluding overhead.

Once those requirements are met, it is possible to deploy NSX:

  • In any type of physical topology – core/aggregation/access, leaf-spine, etc.
  • On any switch from any physical switch vendor, including legacy switches.
  • With any underlying technology. IP connectivity can be achieved over an end-to-end layer 2 network as well as across a fully routed environment

For an optimal design and operation of NSX-T, well known baseline standards are applicable. These standards include:

  • Device availability (e.g., host, TOR, rack level)
  • TOR bandwidth - both host-to-TOR and TOR uplinks
  • Fault and operational domain consistency (e.g., localized peering of Edge node to northbound network, separation of host compute domains etc.)

This design guide uses the example of a routed leaf-spine architecture. This model is a superset of other network topologies and fabric configurations, so its concepts are also applicable to layer 2 and non-leaf-spine topologies

Figure 6‑1 displays a typical enterprise design using the routed leaf-spine design for its fabric. A layer 3 fabric is beneficial as it is simple to set up with generic routers and it reduces the span of layer 2 domains to a single rack.

Figure 6‑1: Typical Enterprise Design

A layer 2 fabric would also be a valid option, for which there would be no L2/L3 boundary at the TOR switch.

Multiple compute racks are configured to host compute hypervisors (e.g., ESXi, KVM) for the application VMs. Compute racks typically have the same structure and the same rack connectivity, allowing for cookie-cutter deployment. Compute clusters are placed horizontally between racks to protect against rack failures or loss of connectivity.

Several racks are designed to the infrastructure. These racks host:

  • Management elements (e.g., vCenter, NSX-T Managers, OpenStack, vRNI, etc.)
  • NSX-T Controllers
  • Bare metal Edge or Edge node VMs
  • Clustered elements are spread between racks to be resilient to rack failures

The different components involved in NSX-T send different kinds of traffic in the network; these are typically categorized using different VLANs. A hypervisor could send management, storage, and virtual machine traffic that would leverage three different VLAN tags. Because this particular physical infrastructure terminates layer 3 at the TOR switch, the span of all VLANs is limited to a single rack. The management VLAN on one rack is not the same broadcast domain as the management VLAN on a different rack as they lack L2 connectivity. In order to simplify the configuration, the same VLAN ID is however typically assigned consistently across rack for each category of traffic. Figure 6‑2 details an example of VLAN and IP subnet assignment across racks.

Figure 6‑2: Typical Layer 3 Design with Example of VLAN/Subnet

Upcoming examples will provide more detailed recommendations on the subnet and VLAN assignment based on the NSX-T component specifics. For smaller NSX deployments, these elements may be combined into a reduced number of racks as detailed in the section Multi-Compute Workload Domain Design Consideration.

6.2 NSX-T Infrastructure Component Connectivity

NSX-T Manager and NSX-T Controller are mandatory NSX-T infrastructure components. Their networking requirement is basic IP connectivity with other NSX-T components over the management network. Both NSX-T Manager and NSX-T Controller are typically deployed on a hypervisor as a standard VLAN backed port group; there is no need for colocation in the same subnet or VLAN. There are no host state dependencies or MTU encapsulation requirements as these components send only management and control plane traffic over the VLAN.

Figure 6‑3 shows a pair of ESXi hypervisors in the management rack hosting the NSX-T Manager along with two of the three NSX-T Controllers.

Figure 6‑3: ESXi Hypervisor in the Management Rack

The ESXi management hypervisors are configured with a VDS/VSS with a management port group mapped to a management VLAN. The management port group is configured with two uplinks using physical NICs “P1” and “P2” attached to different top of rack switches. The uplink teaming policy has no impact on NSX-T Manager or NSX-T Controller operation, so it can be based on existing VSS/VDS policy.

Figure 6‑4 presents the same infrastructure VMs running on KVM hosts.

Figure 6‑4: KVM Hypervisors in the Management Rack

The KVM management hypervisors are configured with a Linux bridge with two uplinks using physical NICs “P1” and “P2”. The traffic is injected into a management VLAN configured in the physical infrastructure. Either active/active or active/standby is fine for the uplink team strategy for NSX-T Manager and NSX-T Controllers since both provide redundancy; this example uses simplest connectivity model with active/standby configuration.

Additional Considerations for NSX-T Manager and Controllers Deployments

The NSX-T Manager is a standalone VM that is critical to NSX-T operation. An availability mechanism is recommended to protect against the failure of its host hypervisor. Note that such a failure scenario would only impact the NSX management plane; existing logical networks would continue to operate.

For a vSphere-based design, it is recommended to leverage vSphere HA functionality to ensure NSX-T Manager availability. Furthermore, NSX-T Manager should be installed on shared storage. vSphere HA requires shared storage so that VMs can be restarted on another host if the original host fails. A similar mechanism is recommended when NSX-T Manager is deployed in a KVM hypervisor environment.

The NSX-T Controller cluster represents a scale-out distributed system where each of the three NSX-T Controller nodes is assigned a set of roles that define the type of tasks that node can implement. For optimal operation, it is critical to understand the availability requirements of Controllers. Three nodes are required for proper operation and deployment of NSX-T; however, the cluster can operate with reduced capacity in the event of a single node failure. To be fully operational, the cluster requires that a majority of NSX-T Controllers (i.e., two out of three) be available. It is recommended to spread the deployment of the NSX-T Controllers across separate hypervisors to ensure that the failure of a single host does not cause the loss of a majority of the cluster. NSX does not natively enforce this design practice. On a vSphere-based management cluster, deploy the NSX-T Controllers in the same vSphere cluster and to leverage the native vSphere Distributed Resource Scheduler (DRS) and anti-affinity rules to avoid instantiating more than one NSX-T Controller node on the same ESXi server. For more information on how to create a VM-to-VM anti-affinity rule, please refer to the VMware documents on VM-to-VM and VM-to-host rules.

Additional considerations apply for controllers with respect to storage availability and IO consistency. A failure of a datastore should not trigger a loss of Controller majority, and the IO access must not be oversubscribed such that it causes unpredictable latency where a Controller goes into read only mode due to lack of write access. For both NSX-T Manager and NSX-T Controller VMs, it is recommended to reserve resources in CPU and memory according to their respective requirements.

The same hypervisor – either ESXi or KVM – can host the NSX-T Manager and one NSX-T Controller as long as that hypervisor has sufficient resources. For high-availability, the three NSX-T Controllers must be spread across three different hosts as shown in Figure 6‑5.

Figure 6‑5: NSX Manager and Controllers in Hypervisors

Note that management hypervisors do not have an N-VDS since they are not part of the NSX-T data plane. They only have the hypervisor switch – VSS/VDS on ESXi and Linux bridge on KVM.

6.2 1 – ESXi Infra Traffic – Talk pining in general and point to Fracois chapter – Nimish

1) failo over order or src id 2) pinning

6.3 Compute Cluster Design (ESXi/KVM) - Dimi

This section covers both ESXi and KVM compute hypervisors; discussions and recommendations apply to both types unless otherwise clearly specified.

Compute hypervisors host the application VMs in the data center. In a typical enterprise design, they will carry at least two kinds of traffic, on different VLANs – management and overlay. Because overlay traffic is involved, the uplinks are subjects to the MTU requirements mentioned earlier.

Compute hosts may carry additional type of infrastructure traffic like storage, vMotion, high availability etc. The ESXi hypervisor defines specific VMkernel interfaces, typically connected to separate VLANs, for this infrastructure traffic. Similarly, for the KVM hypervisor, specific interfaces and VLANs are required. Details on specific hypervisor requirements and capabilities can found in documentation from their respective vendors.

A specific note for the KVM compute hypervisor: NSX uses a single IP stack for management and overlay traffic on KVM hosts. Because of this, both management and overlay interfaces share the same routing table and default gateway. This can be an issue if those two kinds of traffic are sent on different VLANs as the same default gateway cannot exist on two different VLANs. In this case, it is necessary to introduce more specific static routes for the overlay remote networks pointing to a next hop gateway specific to the overlay traffic.

6.3.1 Compute Hypervisor Physical NICs

NSX-T architecture supports multiple N-VDS as well multiple physical NICs per hypervisor. NSX-T does not have any restriction on coexistence with other VMware switches (e.g., VSS, VDS) but physical NICs can only belong to a single virtual switch. Since the overlay traffic must leverage the N-VDS, the compute hypervisor must allocate at least one pNIC to the N-VDS. Additionally, there are no requirements to put any non-overlay traffic - including management, vMotion, and storage - on the N-VDS. The management traffic could be on any type of virtual switch, so it could remain on a VSS/VDS on ESXi or the Linux bridge on KVM.

In a typical enterprise design, a compute hypervisor has two pNICs. If each pNIC is dedicated to a different virtual switch, there will be no redundancy. This design for compute hypervisors with two pNICs mandates that both uplinks are assigned to the N-VDS. In this instance, the management traffic will be carried on those uplinks along with the overlay traffic. If the hypervisor is leveraging additional interfaces for infrastructure traffic – including vMotion or ESXi HA – those interfaces and their respective VLANs must also utilize the N-VDS and its uplinks.

In an environment where the compute node is designed with four pNICs, there is flexibility to allocate two pNIC to the N-VDS and another two to the non-overlay traffic. Such a design is out of the scope for this example.

When installing NSX on a ESXi hypervisor, it is typical to start from the existing virtual switch. After creating the N-VDS, management interfaces and pNICs must be migrated to this N-VDS; a summary of this procedure is shown in Appendix 1. Note that N-VDS traffic engineering capabilities may not always match the original virtual switch, due to the fact that it may not be applicable to non-ESXi hypervisor.

In NSX-T 2.0, the management interface cannot be located on the N-VDS for a KVM compute hypervisor. For examples involving KVM, this design guide uses a third interface dedicated to management traffic.

This design guide targets typical enterprises deployments deploying compute hypervisors with the following parameters:

  • 2 x 10/40 Gbps NICs
  • All host traffic (e.g., overlay, management, storage, vMotion) shares the common NICs
  • Each type of host traffic has dedicated IP subnets and VLANs

6.3.2 ESXi-Based Compute Hypervisor

The teaming mode offers a choice in the availability and traffic load-sharing design. The N-VDS offers two types of teaming mode design for the ESXi hypervisor – failover order and load balanced source. There can only be one teaming mode per N-VDS. In this section only a two-pNIC design is shown, however base design principle remains the same for more than two pNICs.

6.3.2.1 Failover Order Teaming Mode

 

Figure 6‑6: ESXi Compute Rack Failover Order Teaming

In Figure 6‑6, a single host switch is used with a 2x10/40 Gbps PNIC design. This host switch manages all traffic – overlay, management, storage, vMotion, etc. Physical NICs “P1” and “P2” are attached to different top of rack switches. The teaming option selected is failover order active/standby; “Uplink1” is active while “Uplink2” is standby. As shown logical switching section, host traffic is carried on the active uplink “Uplink1”, while “Uplink2” is purely backup in the case of a port or switch failure. This teaming policy provides a deterministic and simple design for traffic management.

The top-of-rack switches are configured with a first hop redundancy protocol (e.g. HSRP, VRRP) providing an active default gateway for all the VLANs on “ToR-Left”. The VMs are attached to logical switches defined on the N-VDS, with the default gateway set to the logical interface of the distributed tier-1 logical router instance.

6.3.2.2 Load Balance Source Teaming Mode 

 

Figure 6‑7: ESXi Compute Rack Source Port Teaming

Similar to the failover order teaming example, Figure 6‑7 shows a 2x10G PNIC design, where a single N-VDS can be used to maintain redundancy. As before, this host switch manages all traffic while physical NICs “P1” and “P2” are attached to different top of rack switches. In this design, the teaming option selected is load balance source. With this policy, potentially both uplinks are utilized based on the hash value generated from the source MAC. Both infrastructure and guest VM traffic benefit from this policy, allowing the use of all available uplinks on the host. This is a typical design choice in existing deployments. A recommended design change compared to failover teaming policy is the designation of first hop redundancy protocol (FHRP) redundancy. Since all uplinks are in use, FHRP can be used to better distribute different types of traffic, helping reduce traffic across the inter-switch link. As the teaming option does not control which link will be utilized for a VMkernel interface, there will be some inter-switch link traffic; splitting FHRP distribution will help reduce the probability of congestion. For the overlay traffic, both links will be used.

The ToR switches are configured with an FHRP, providing an active default gateway for storage and vMotion traffic on “ToR-Left”, management and overlay traffic on “ToR-Right”. The VMs are attached to logical switches defined on the N-VDS, with the default gateway set to the logical interface of the distributed tier-1 logical router instance.

An alternate teaming mode supporting LAG would require the ESXi hosts be connected to separate ToRs forming a single logical link. This would require multi-chassis link aggregation on the ToRs and would be specific to vendor. This mode is not recommended as it requires multiple vendor specific implementations, support coordination, limited features support and could suffer from troubleshooting complexity.

4 x 10 Design with N-VDS and VDS – here the traffic separation

6.3.3 KVM Compute Hypervisor Design

Figure 6‑8: KVM Compute Rack Failover Teaming

In Figure 6‑8, the KVM compute hypervisor has physical NIC “P1” dedicated for management.

Half of the KVM hypervisors have their “P1” NICs connected to “ToR-Left” and other half to “ToR-Right”; this offers basic load balancing and high availability of management traffic. This does not provide redundancy for KVM management; it is a temporary design until traffic is consolidated on the uplinks of the N-VDS. Redundancy is possible but would require an additional pNIC on the host. Alternatively, the KVM host could be configured with LAG; this would require connecting uplinks to separate ToRs, mandating specific support for vendor specific multi-chassis link aggregation.

Key design points include:

  • The KVM compute hypervisor is configured with “N-VDS 1”. Its VMs are attached to the “NSX-managed” NSX-T bridge. NSX-T Manager then adds the VM “VIF-attachment” to the logical switch.
  •  “N-VDS 1” is configured with two uplinks using physical NICs “P2” and “P3”.
  • P2 is attached to “ToR-Left” and “P3” is attached to “ToR-Right”.
  • All the infrastructure interfaces except management are configured on the N-VDS in their respective VLANs.
  • For ease of operation the “N-VDS 1” is leveraging an active/standby teaming policy for its uplinks, with “Uplink1” active and “Uplink2” standby.
  • All guest VMs are connected through “P2” to “ToR-Left”.
  • The top of rack switches are configured with an FHRP providing an active default gateway for all the VLANs on “ToR-Left”.
 

6.4 Edge Node and Services Design

Edge nodes provide a pool of capacity for running centralized services in NSX-T. They connect to three different types of IP routed networks for specific purposes:

  • Management – Accessing and controlling the Edge node
  • Overlay -  Creating tunnels with peer transport nodes
  • External - Peering with the physical networking infrastructure to provide connectivity between the NSX-T virtual components and the external network

Edge nodes are available in two form factors – VM and bare metal server. While both form factors offer the same functionality, their physical infrastructure connectivity is quite different. The following sections examine their specific requirements.

Edge nodes are always active in the context of connectivity and control plane. They host tier-0 and tier-1 routing services, installed in either active/active or active/standby mode. Additionally, if a tier-0 or tier-1 router enables stateful services (e.g., NAT, load balancer, firewall) it can only be deployed in active/standby mode. The status of active or standby mode is within the context of data plane forwarding rather than related to the Edge node itself. The Edge node connectivity options discussed below are independent of type of services that a given node runs.

6.4.1 Bare Metal Edge Design

The bare metal Edge node is a dedicated physical server that runs a special version NSX-T Edge software. The current implementation of DPDK on the Edge node defines of two types of NICs on the server:

  • Fast Path – There are up to four NICs dedicated to the data path leveraging DPDK for high performance. The overlay and external networks run on these NICs.
  • Management – The management network cannot be run on a fast path interface; it requires a dedicated NIC from the bare metal server. This does not need to be 10G NIC; the use of the on-board 1G NIC is recommended when available.

The bare metal Edge node requires a NIC supporting DPDK. VMware maintains a list of the compatibility with various vendor NICs.

This design guide covers a common configuration using the minimum number of NICs on the Edge nodes. For the bare metal Edge option, the typical enterprise design will feature two Edge nodes with three interface each:

  • 1x1Gbps NIC dedicated to management
  • 2  pNICs for NSX-T data plane traffic, shared between overlay and external networks. Type of pNIC speed and vendor dependencies can be referred in above compatibility
6.4.1.1 Physical Infrastructure

Figure 6‑9 shows an Edge rack architecture for the simple enterprise design.

Figure 6‑9: Typical Enterprise Bare Metal Edge Node Rack

The bare metal Edges “EN1” and “EN2” are each configured with a management interface attached to “P1” and an “N-VDS 1” with a single uplink, comprised of pNICs “P2” and “P3”” configured as a LAG. The interfaces of an individual bare metal Edge communicate only with a default gateway or a routing peer on the directly connected top of rack switch. As a result, the uplink-connected VLANs are local to a given TOR and are not extended on the inter-switch link “ToR-Left” and “ToR-Right”. This design utilizes a straight through LAG to a single TOR switch or access device, offering the best traffic distribution possible across the two pNICs dedicated to the NSX-T data plane traffic. The VLAN IDs for the management and overlay interfaces can be unique or common between two ToR, it only has local significance. However, subnet for those interfaces is unique. If common subnet is used for management and overlay, then it requires carrying those VLANs between ToRs and routing north bound to cover the case of the all uplinks of given ToR fails. The VLAN ID and subnet for the external peering connectivity is unique on both ToR and carries common best practices of localized VLAN for routing adjacencies.

The routing over a straight through LAG is simple and a supported choice. This should not be confused with a typical LAG topology that spans multiple TOR switches. A particular Edge node shares its fate with the TOR switch to which it connects, creating as single point of failure. In the design best practices, multiple Edge nodes are present so that the failure of a single node resulting from a TOR failure is not a high impact event. This is the reason for the recommendation to use multiple TOR switches and multiple Edge nodes with distinct connectivity.

This design leverages an unconventional dual attachment of a bare metal Edge to a single top of rack switch. The rationale is based on the best strategy for even traffic distribution and overlay redundancy.

6.4.1.2 Peering with Physical Infrastructure Routers

In the typical enterprise design, the two bare metal Edge nodes in Figure 6‑9 are assigned to a tier-0 router. This tier-0 router peers with the physical infrastructure using eBGP. Two adjacencies have been created, one with each of the “ToR-Left” and “ToR-Right” routers. Figure 6‑10 represents the logical view of this network.

Figure 6‑10: Typical Enterprise Bare Metal Edge Note Logical View with Overlay/External Traffic

From the perspective of the physical networking devices, the tier-0 router looks like a single logical router; logically, the adjacency to “Router1” is hosted on Edge node “EN1”, while “EN2” is implementing the peering to “Router2”.

Those adjacencies are protected with BFD, allowing for quick failover should a router or an uplink fail. See the specific recommendation on graceful restart and BFD interaction based on type of services – ECMP or stateful services – enabled in the Edge node and type of physical routers supported.

6.4.1.3 Design Considerations for Service Components

Figure 6‑11 represents the physical view of the same network, detailing the physical instantiation of components on the Edge nodes and the traffic flows between them.

Figure 6‑11: Typical Enterprise Bare Metal Edge Node Physical View with Management/Overlay/External Traffic

This services deployment leverages the functionality described in the high availability section of the Edge node in logical routing chapter. Figure 6‑11 depicts various services deployed in a distributed manner under the bare metal Edge node. Tier-0, represented by the green node, is deployed in active/active mode advertising internal routes while receiving external routes from the two physical routers forming an ECMP paths. In this configuration, traffic can enter and leave the NSX-T environment through either Edge node. The logical view also introduces multiple tenant logical routing services, shown as orange nodes, along with blue node tier-1 routers. The orange tier-1 router is running a centralized service (e.g., NAT) so it will have an SR component running on the Edge nodes. The active SR component is on “EN1” with standby on “EN2”. The blue tier-1 router is entirely distributed (i.e., EMPC mode) as it does not have an SR component instantiated on the Edge nodes.

6.4.2 Edge Node VM

An Edge node can run as a virtual machine on an ESXi hypervisor. This form factor is derived from the same software code base as the bare metal Edge. An N-VDS switch is embedded inside the Edge node VM with four fast path NICs and one management vNIC.  The typical enterprise design with two Edge node VMs will leverage 4 vNICs:

  • One vNIC dedicated to management traffic
  • One vNIC dedicated to overlay traffic
  • Two vNICs dedicated to external traffic

There can only be on one teaming policy per N-VDS. To develop proper connectivity, it may be necessary to have more than one N-VDS per Edge node. Since an Edge node runs on ESXi, it connects to a VDS, providing flexibility in assigning a variety of teaming policies. As a result, each NIC will be mapped to a dedicated port group in the ESXi hypervisor, offering maximum flexibility in the assignment of the different kind of traffic to the host’s two physical NICs.

6.4.2.1 Physical Connectivity

The physical connectivity of the ESXi hypervisor hosting the VM Edge nodes is similar to that for compute hypervisors: two 10Gbps uplinks, each connected to a different TOR switch. Traffic allocation over the physical uplinks depends on the specific configuration of port groups. Figure 6‑12 provides a summary of this configuration.

[A1] 

Figure 6‑12: 2x10Gbps Host with Edge Node VM

Traffic profiles for both Edge node VMs “EN1” and “EN2” are configured as follows:

  • Management: “vNIC1” is the management interface for the Edge VM. It is connected to port group “Mgt-PG” with a failover order teaming policy specifying “P1” as the active uplink and “P2” as standby
  • Overlay: “vNIC2” is the overlay interface, connected to port group “Transport-PG” with a failover order teaming policy specifying “P1” as active and “P2” as standby. The TEP for the Edge node is created on an “N-VDS 1” that has “vNIC2” as its unique uplink.
  • External: This configuration leverages the best practices of simplifying peering connectivity. The VLANs used for peering are localized to each TOR switch, eliminating the spanning of VLANs (i.e., no STP looped topology) and creating a one-to-one relationship with routing adjacency to the Edge node VM. It is important to ensure sure that traffic destined for a particular TOR switch exits the hypervisor on the appropriate uplink directly connected to that TOR. For this purpose, the design leverages two different port groups:
    • “Ext1-PG” – “P1” in VLAN “External1-VLAN” as its unique active pNIC.
    • “Ext2-PG” – “P2” in VLAN “External2-VLAN” as its unique active pNIC.

    The Edge node VM will have two N-VDS:

    • “N-VDS 2” with “vNIC3” as unique uplink
    • “N-VDS 3” with “vNIC4” as unique uplink.

This configuration ensures that Edge VM traffic sent on “N-VDS 2” can only exit the hypervisor on pNIC “P1” and will be tagged with an “External1-VLAN” tag. Similarly, “N-VDS 3” can only use “P2” and will receive an “External2-VLAN” tag.

6.4.2.2 Peering with Physical Infrastructure Routers

Overlay and management traffic leverage default gateways in their respective subnets that are active on “Router1” on the top of rack switch “ToR-Left”. Those default gateways are implemented leveraging HSRP or VRRP between the top-of-rack switches where “ToR-Left” is configured as active. This ensure no overlay or management traffic will cross the inter-switch link under stable conditions while providing maximum possible bandwidth between the hosts in the same rack.

For external traffic, Figure 6‑13 presents a logical view detailing proper setup.

Figure 6‑13: Typical Enterprise Edge Node VM Logical View with Overlay/External Traffic

From a logical standpoint, four eBGP adjacencies are established between a tier-0 logical router and two physical routers – “Router1” and “Router2” – running on the TOR switches. The tier-0 router SR components responsible for peering with the physical infrastructure are located on the two Edge node VMs “EN1” and “EN2”. Two eBGP adjacencies are set up on each Edge Node VM:

  • Adjacency to “Router1” via a tier-0 interface attached to “N-VDS 2”
  • Adjacency to “Router2” via a tier-0 interface attached to “N-VDS 3”

These adjacencies are protected with BFD, allowing for quick failover should a router or an uplink fail. See specific recommendations on graceful restart and BFD interaction based on type of services (e.g., ECMP or stateful services) enabled in the Edge node and specific physical routers supported.

6.4.2.3 Component Location and Traffic Path

Both overlay and management traffic use only pNIC “P1” on the hypervisors hosting the Edge node VMs and top of rack switch “ToR-Left”. Figure 6‑14 traces the traffic path.

Figure 6‑14: Typical Enterprise Edge Node VM Physical View with Management/Overlay Traffic

Because tier-0 is not configured with a stateful service (e.g., NAT or Edge firewall), it has been deployed in active/active mode and advertises its internal routes identically to the two physical routers. This allows for ECMP load balancing of the external traffic as represented in Figure 6‑15.

Figure 6‑15: Typical Enterprise Edge Node VM Physical View with External Traffic

4 X 10 Design – will covers SRC ID based teaming for Transport PG, consistent with 2 x 10 dsing

6.4.3 Edge Cluster

Edge cluster functionality allows the grouping of up to eight Edge nodes. NSX-T supports up to eight Edge nodes in one Edge cluster. The grouping of Edge nodes offers the benefits of high availability and scale out performance for the Edge node services. Additionally, multiple Edge clusters can be deployed within a single NSX-T Manager, allowing for the creation of pool of capacity that can be dedicated to specific services (e.g., NAT at tier-0 vs. NAT at tier-1).

Figure 6‑16: One Edge Cluster with 2 Edge Nodes

The bare metal Edge nodes “EN1” and “EN2” in Figure 6‑16 are in the single Edge cluster. Active/active “Tier0-SR”, shown in green, has uplinks on both “EN1” and “EN2” to provide scale and high availability. Active/standby “Tier1-SR”, show in orange, is placed automatically in both “EN1” and “EN2”. The cloud network administrator can manually decide which Edge node hosts the active “Tier1-SR”; alternatively, NSX-T Manager can automatically distribute the different active “Tier1-SR” instances across both “EN1” and “EN2”. Note that this active Tier-SR automatic placement is not based on the Edge node load, rather on the number of tier-1 SRs already installed.

Within a single Edge cluster, all Edge nodes must be the same type – either bare metal or VM. Edge node VMs of different size can be mixed in the same Edge cluster, as can bare metal Edge nodes of different performance levels. For a given deployment of tier-0 or tier-1, the services must be deployed in same cluster.

A mixture of different sizes/performance levels within the same Edge cluster can have the following effects:

  • With two Edge nodes hosting a tier-0 configured in active/active mode, traffic will be spread evenly. If one Edge node is of lower capacity or performance, half of the traffic may see reduced performance while the other Edge node has excess capacity.
  • For two Edge nodes hosting a tier-0 or tier-1 configured in active/standby mode, only one Edge node is processing the entire traffic load. If this Edge node fails, the second Edge node will become active but may not be able to meet production requirements, leading to slowness or dropped connections.

An additional design consideration applies to Edge clusters with more than two Edge nodes. Figure 6‑17 shows connectivity of Edge nodes where four Edge nodes belong to single Edge cluster. In this diagram, two Edge nodes are connected to “ToR-Left” and the other two are connected to “ToR-Right”.

Figure 6‑17: One Edge Cluster with 4 Edge Nodes

As part of designating Edge node services, both the role (i.e., active or standby) and location must be explicitly defined and for a given deployment of tier-0 or tier-1, the services must be deployed in same cluster. Without this specificity, the two Edge nodes chosen could be “EN1” and “EN2”, which would result in both active and standby tiers being unreachable in the event of a “ToR-Left” failure. It is highly recommended to deploy two separate Edge clusters when the number of Edge nodes is greater than two, as shown in Figure 6‑18.

 

Figure 6‑18: Two Edge Clusters with 2 Edge Nodes Each

This configuration deploys the tiers in two Edge clusters, allowing maximum availability under a failure condition, placing tier-0 on “Edge Cluster1” and tier-1 on “Edge Cluster2”.

6.5 Multi-Compute Workload Domain Design Consideration

NSX-T enables an operational model that supports compute domain diversity, allowing for multiple vSphere domains to operate alongside a KVM-based environment. NSX-T also supports PaaS compute (e.g., Pivotal Cloud Foundry, Red Hat OpenShift) as well as cloud-based workload domains. This design guide only covers ESXi and KVM based compute domains; container-based workload requires extensive treatment of environmental specifics and will be covered in dedicated design guide. Figure 6‑19 offers capability of NSX-T supporting of diverse compute workloads domains.

Figure 6‑19: Single Architecture for Heterogeneous Compute and Cloud Native Application Framework

1- Subject to future availability and market requirements

Important factors for consideration include how best to design these workload domains as well as how the capabilities and limitations of each component influence the arrangement of NSX-T resources. Designing multi-domain compute requires considerations of the following key factors:

  • Type of Workloads
    • Enterprise applications, QA, DevOps
    • Regulation and compliance
    • Scale and performance
    • Security
  • Compute Domain Capability
    • Underlying compute management and hypervisor capability
    • Inventory of objects and attributes controls
    • Lifecycle management
    • Ecosystem support – applications, storage, and knowledge
    • Networking capability of each hypervisor and domain
  • Availability and Agility
    • Cross-domain mobility (e.g., cross-VC)
    • Hybrid connectivity
  • Scale and Capacity
    • Compute hypervisor scale
    • Application performance requiring services such as NAT or load balancer
    • Bandwidth requirements, either as a whole compute or per compute domains

NSX-T provides modularity, allowing design to scale based on requirements. Gathering requirements is an important part of sizing and cluster design and must identify the critical criteria from above set of factors.

Design considerations for enabling NSX-T vary with environmental specifics: single domains to multiples; few hosts to hundreds; scaling from basics to compute domain maximums. Regardless of these deployment size concerns, there are a few baseline characteristics of the NSX-T platform that need to be understood and can be applicable to any deployment models.

6.5.1 Common Deployment Consideration with NSX-T Components

Common deployment considerations include:

  • NSX-T management components require only VLANs and IP connectivity; they can co-exist with any hypervisor supported in a given release.
  • NSX-T Controller operation is independent of vSphere. It can belong to any independent hypervisor or cluster as long as the NSX-T Controller has consistent connectivity and latency to the NSX-T domain.
  • For a predictable operational consistency, all management, controller, and Edge node VM elements must have their resources reserved. This includes vCenter, NSX-T Manager, NSX-T Controllers, and Edge nodes VMs.
  • An N-VDS can coexist with another N-VDS VDS, however they cannot share interfaces.
  • A given N-VDS can have only one teaming policy.
  • An Edge node VM has an embedded N-VDS which encapsulates overlay traffic for the guest VMs. It does not require a hypervisor be prepared for the NSX-T overlay network; the only requirements are a VLAN and proper MTU. This allows flexibility to deploy the Edge node VM in either a dedicated or shared cluster.
  • For high availability:
    • Three controllers must be on different hypervisors
    • Edge node VMs must be on different hypervisors to avoid single point of failure
  • Understanding of design guidance for Edge node connectivity and availability as well as services – ECMP and/or stateful – dependencies and the related Edge clustering choices.

Deployment models of NSX-T components depend on the following criteria:

  • Multi-domain compute deployment models
  • Common platform deployment considerations
  • Type of hypervisor used to support management and Edge components
  • Optimization of infrastructure footprint – shared vs. dedicated resources
  • Services scale, performance, and availability

The following subsections cover two arrangements for components applicable to these criteria. The first design model offers collapsed management and Edge resources. The second one covers a typical enterprise-scale design model with dedicated management and Edge resources. These design modes offer an insight into considerations on and value of each approach. They do not preclude the use other models (e.g., single cluster or dedicated purpose built) designed to address specific use cases.

6.5.2 Collapsed Management and Edge Resources Design

This design assumes multiple compute clusters or domains serving independent workloads. The first example offers an ESXi-only hypervisor domain, while the second presents a multi-vendor environment with both ESXi and KVM hypervisors. Each type of compute could be in a separate domain with a dedicated NSX-T domain, however this example presents a single common NSX-T domain.

Both compute domains are managed via a common cluster for NSX-T management and Edge resources. Alternatively, a dedicated Edge cluster serving could be used to independently support the compute domain. The common rationales for allowing the management and Edge resources are as follows:

  • Edge services are deterministic and CPU-centric, requiring careful resource reservation. Mixing Edge and management components is better since management workload is predictable compared to compute workload.
  • Reduction in the number of hosts required to optimizes the cost footprint.
  • Potential for shared management and resources co-existing in the NSX-V and NSX-T domains. Additional consideration such as excluding NSX-T components from DFW policy and SLA also apply.

Figure 6‑20: Collapsed Management and Edge Resources Design – ESXi Only

The first deployment model, shown in Figure 6‑20, consists of multiple independent vCenter managed compute domains. Multiple vCenters can register with the NSX-T Manager. These vCenter instances are not restricted to a common version and can offer capabilities not tied to NSX-T. The NSX-T can provide consistent logical networking and security enforcement independent of the vCenter compute domain. The connectivity is managed by NSX-T by managing independent switches on each hypervisor, enabling the connectivity of workload between distinct vCenter compute VMs.

 

Figure 6‑21: Collapsed Management and Edge Resources Design – ESXi + KVM

The second deployment, pictured in Figure 6‑21, shows two independent hypervisor compute domains. The first is ESXi-based, the other two are based on KVM hypervisors. As before, each domain is overseen by NSX-T with common logical and security enforcement.

6.5.2.1 Collapsed Management and Edge Cluster

Both of the above designs have the minimally recommended three ESXi servers for management cluster; however, traditional vSphere best practice is to use four ESXi hosts to allow for host maintenance and maintain the consistent capacity. The following components are shared in the clusters:

  • Management – vCenter and NSX-T Manager with vSphere HA enabled to protect NSX-T Manager from host failure and provide resource reservation.
  • Controller – NSX-T Controllers on separate hosts with an anti-affinity setting and resource reservation.
  • Services – The Edge cluster is shown with four Edge node VMs but does not describe the specific services present. While this design assumes active/standby Edge nodes to support the Edge firewall and NAT services, it does not preclude some other combinations of services.

Where firewall or NAT services are not required, typically active/active (ECMP) services that support higher bandwidth are deployed. A minimum of two Edge nodes is required on each ESXi host, allowing bandwidth to scale to 20Gbps. Further expansion is possible by adding additional Edge node VMs, scaling up to a total of eight Edge VMs. For further details, refer to the Edge cluster design considerations. For multi-10Gbps traffic requirements or line rate stateful services, consider the addition of a dedicated bare metal Edge cluster for specific services workloads. Alternatively, the design can start with distributed firewall micro-segmentation and eventually move to overlay and other Edge services.

Compute node connectivity for ESXi and KVM is discussed in the Compute Cluster Design section. Figure 6‑22 describes the connectivity for shared management and Edge node.

 

Figure 6‑22: ESXi Connectivity for Shared Management and Edge Node

This design assumes ESXi hosts have two physical NICs, attached as follows:

  • Port “P1” is connected to “ToR-Left” and port “P2” to “ToR-Right”.
  • The Edge node VM has three N-VDS. Since each N-VDS switch can have only one teaming policy, each N-VDS has a teaming policy based on type of traffic it carries. “N-VDS1” carries the overlay traffic in active-standby teaming mode. External eBGP connectivity follows dedicated N-VDS so that it can steer the traffic only on the specific physical port. The Edge node VM is configured with two hosts as detailed in the section VM Form Factor.
  • VDS is configured with pNICs “P1” and “P2”. Related portgroup assignments include:
    •  “Mgt PG” has “P1” active and “P2” standby. Associated with this portgroup are the management IP address, management and controller elements, and Edge node management vNIC.
    •  “vMotion PG” has “P1” active and “P2” standby. The ESXi VMkernel vMotion IP address is associated with this portgroup.
    • “Storage PG” has “P1” active and “P2” standby. The ESXi VMkernel storage IP address is associated with this portgroup.
    • “Transport PG” has “P1” active and “P2” standby. The Edge node overlay vNIC “N-VDS 1” is connected to this portgroup.
    • “Ext1 PG” contains “P1” only. The Edge node external vNIC “N-VDS 2” is connected to this portgroup.
    • “Ext2 PG” contains “P2” only. The Edge node external vNIC “N-VDS 3” is connected to this portgroup.

6.5.3 Dedicated Management and Edge Resources Design

This example presents an enterprise scale design with dedicated compute, management, and Edge clusters. The compute cluster design examines both ESXi-only and KVI-only options, each of which contribute to requirements for the associated management and Edge clusters. The initial discussion focuses on recommendations for separate management, compute, and Edge to cover the following design requirements:

  • Diversity of hypervisor and requirements as discussed under common deployment considerations
  • Multiple vCenters managing distinct sets of virtualized workloads
  • Compute workload characteristics and variability
  • Higher degree of on-demand compute.
  • Compliance standards (e.g., PCI, HIPPA, government)
  • Automation flexibility and controls
  • Multiple vCenters managing production, development, and QA segments
  • Migration of workloads across multiple vCenters
  • Multi-10G traffic patterns for both E-W and N-S traffic
  • Multi-tenancy for scale, services, and separation
6.5.3.1 Enterprise ESXi Based Design

The enterprise ESXi-hypervisor based design deployment may consists of multiple vSphere domains and usually consists of a dedicated management cluster. The NSX-T components that reside in management clusters are NSX-T Managers and Controllers. The requirements for those components are the same as with the collapsed management/Edge design in the previous section but are repeated to drive the focus that management cluster is ESXi based. Compute node connectivity for ESXi and KVM is discussed in section Compute Cluster Design.  For the management cluster, the design presented in Figure 6-23 has a minimum recommendation for three ESXi hosts. A standard vSphere best practice suggests using four ESXi hosts to allow for host maintenance while maintaining consistent capacity. The following components are shared in the clusters:

  • Management – vCenter and NSX-T Manager with vSphere HA enabled to protect NSX-T Manager from host failure as well as provide resource reservation.
  • Controller – NSX-T Controllers on separate hosts with an anti-affinity setting and resource reservation.

Figure 6‑23: Dedicated Management and Edge Resources Design – ESXi + Edge VM Cluster

The Edge cluster design takes into consideration workload type, flexibility, and performance requirements based on a simple ECMP-based design including services such as NAT. As discussed in Edge Node and Services Design, the design choices for an Edge cluster permit a bare metal and/or VM form factor. A second design consideration is the operational requirements of services deployed in active/active or active/standby mode.

This bare metal Edge form factor is recommended when a workload requires multi-10Gbps connectivity to and from external networks, usually with active/active ECMP based services enabled. The availability model for bare metal is described in Edge Cluster and may require more than one Edge cluster depending on number of nodes required to service the bandwidth demand. Additionally, typical enterprise workloads may require services such as NAT, firewall, or load balancer at high performance levels. In these instance, a bare metal Edge can be considered with tier-0 running in active/standby node. A multi-tenant design requiring various types of tier-0 services in different combinations is typically more suited to a VM Edge node since a given bare metal node can enable only one tier-0 instance. Figure 6-24 displays multiple Edge clusters – one based on the Edge node VM form factor and other bare metal – to help conceptualize the possibility of multiple clusters. The use of each type of cluster will depend on the selection of services and performance requirements, while as multi-tenancy flexibility will provide independent control of resources configurations as long as a given cluster consists of the same type of node.

Figure 6‑24: Dedicated Management and Edge Resources Design – ESXi Only + Mix of Edge Nodes

The VM Edge form factor is recommended for workloads that do not require line rate performance. It offers flexibility of scaling both in term of on-demand addition of bandwidth as well speed of service deployment. This form factor also makes the lifecycle of Edge services practical since it runs on the ESXi hypervisor. This form factor also allows flexible evolution of services and elastic scaling of the number of nodes required based on bandwidth need. A typical deployment starts with four hosts, each hosting Edge VMs, and can scale up to eight nodes. The Edge Node VM section describes physical connectivity with a single Edge node VM in the host, which can be expanded to additional Edge node VMs per host. If there are multiple Edge VMs deployed in a single host that are used for active/standby services, the design will require more than one Edge cluster to avoid single point of failure issues.

Some use cases may necessitate multiple Edge clusters comprised of sets of bare metal or VM Edge nodes. This may be useful when tier-1 requires a rich variety of services but has limited bandwidth requirements while tier-0 logical routers require the performance of bare metal. Another example is separation of a service provider environment at tier-0 from a deployment autonomy model at tier-1.

This may be required for a multi-tenant solution in which all tier-0 logical routers are deployed in a bare metal cluster while tier-1 is deployed with Edge node VMs based on the requirement for low-bandwidth services. This would also provide complete control/autonomy for provisioning of Edge tenant services while tier-0 offered static resources (e.g., provider Edge services).

6.5.3.2 Enterprise KVM Based Design

The enterprise KVM hypervisor-based design assumes all the components – management, Edge, and compute – are deployed with KVM as the base hypervisor. It relies on the KVM-based hypervisor to provide its own availability, agility and redundancy; thus it does not cover ESXi-centric capabilities including high availability, resource reservation, or vMotion.

Compute node connectivity for ESXI and KVM is discussed in the section Compute Cluster Design.

For the management cluster, this design recommends a minimum of three KVM servers. The following components are shared in the cluster:

  • Management – vCenter and NSX-T Manager with an appropriate high availability feature enabled to protect the NSX-T Manager from host failure as well as providing resource reservation.
  • Controller – NSX-T Controllers on separate hosts with an anti-affinity feature setting and resource reservation.

Figure 6‑25: Dedicated Management and Edge Resources Design – KVM Only

With KVM as the only hypervisor, the bare metal Edge node is the only applicable form factor. The bare metal cluster considerations are the same as discussed in ESXi design example.

7 Conclusion

This NSX-T design guide presents the foundation for deploying a next generation network virtualization platform. It opens up new possibilities for the coexistence of a variety of workloads spanning multiple vCenters and the coexistence of multiple types of hypervisors – along with cloud native applications – running in the cloud or in containers. This platform also lays the groundwork for future hybrid cloud connectivity. It delivers high performance bare metal Edge node options for line rate stateful performance. The flexibility of NSX-T also introduces distributed routing services and reduces configuration complexity. The distributed controller architecture allows for processing of distributed firewalls and other services across multiple hypervisors. As customers adopt this new architecture and identify new use cases, this design guide will growth with the knowledge learned through the iterative process. Readers are highly encouraged to provide feedback to help improve this design guide. This design guide is a work of many team members of Networking and Security Business Unit of VMware.

Appendix 1: Migrating VMkernel Interfaces to the N-VDS

On an ESXi host, a physical port can only belong to a single virtual switch. Running several vSwitches on a host will require multiple physical ports, increasing expense. This multiple vSwitch scenario is common when deploying NSX-T; typically during the installation the VMkernel interfaces of the ESXi host are left on the VSS, while additional uplinks are allocated to the N-VDS. Where there are only two uplinks, this configuration is suboptimal as each vSwitch has a single physical uplink, thus no redundancy. It is desirable that both uplinks belong to the N-VDS; all VMkernel interfaces (vmkNICs) should be migrated to this N-VDS.

As of NSX-T 2.0, the procedure of migrating vmkNICs to the N-VDS is only available via API call to the NSX-T Manager. The simplest process for migration follows these steps:

  • Prepare the ESXi hypervisor to be a transport node with an uplink profile specifying two uplinks.
  • Only assign one of the two physical uplinks to the transport node first. Make sure that this uplink is attached to both the Overlay Transport Zone as well as a VLAN transport zone. The VLAN transport zone must include VLAN-backed logical switches providing the same connectivity as the port groups where the vmkNICs were attached on the VSS.
  • Migrate the vmkNICs one by one from the VSS to the N-VDS using the following API call:

PUT api/v1/transport-nodes/<tn-id>?if_id=vmkX&esx_mgmt_if_migration_dest=<LSwitch>

Where:

    • <tn-id> is the UUID of the transport node
    • vmkX is the name of the vmkNIC (vmk1, vmk2 etc..)
    • <LSwitch> is the UUID of the appropriate destination VLAN-backed logical switch

Finally, the remaining physical interface of the VSS can be migrated to the N-VDS by reconfiguring the transport node to include both physical uplinks of the host

Appendix 1‑1: ESXi Migration from VSS to N-VDS

Appendix 3: Glossary of NSX-T

  • NSX-T Manager (NSX-Mgr)
  • NSX-T Controller (NSX-Ctrl)
  • NSX-T Edge Node [VM | Bare Metal] (Edge[-VM|-BM])
  • N-VDS – NSX Virtual Distributed Switch
  • VDS – vSphere Distributed Switch
  • ESXi
  • KVM
  • Transport Node (TN)
  • Transport Zone (TZ)
  • Transport is the underlay network running the Overlay
  • TEP or TEPs – Tunnel End Points for Overlay
  • Overlay
  • Uplink – Used in multiple context – uplink of VDS or Uplink in Edge
  • Ext – External Connection or uplink to ToR

Filter Tags

  • Intermediate
  • Design Guide
  • Document
  • NSX Data Center
  • Design
  • Container Networking
  • Multi-Cloud Networking
  • Network Operations
  • Security