VMware NSX Bare Metal Edge Performance

Introduction



This document is designed for virtualization, security, and network architects interested in deploying VMware NSX Bare Metal Edge in multi-cloud environments, Software-Defined Data Center (SDDC) architectures, or Telco networks.  VMware pioneered this multi-cloud solution to transform data centers and increase business agility. Each and every cloud is rooted in virtualization and is defined by three pillars: network virtualization, server virtualization, and storage virtualization. NSX Edge Node plays an essential role when it comes to network virtualization for networking and security services. The throughput that NSX Edge Node can support is critical for the entire ecosystem and services running on top of it.

This white paper outlines the performance results that can be achieved with NSX Bare Metal Edge for customers who implement Bare Metal Edge for their virtual networking infrastructure. The document focuses on testing of L3 routing throughput utilization that BME can support. It does not use security services or networking services (NAT, Gateway Firewall, or Advanced Security services).  The content below describes the design and specific settings that have been used for executing RFC2544 Performance tests.

VMware NSX supports two form factors for Edge Transport Node: Edge Virtual Machine (VM) or Bare Metal (BM) Edge. Depending on the business requirements and technical requirements, we can use VM form factor Edge or BM Edge. Below we are passing through design and architecture considerations for BM Edge and what settings to use so we can maximize the performance. The document also describes the process of testing performance and how we are generating traffic.

Bare Metal Edge

Important Considerations for Selecting Hardware

For Bare Metal Edge, you need to select hardware that will satisfy the requirements for the throughput you need to achieve. To make this selection, you should consider these important components:

  1. Physical NIC
  2. CPU
  3. Memory
  4. PCI slots
  5. Storage

Typically, it makes sense to evaluate the physical NICs and bandwidth required, then consider CPU and memory, then finally evaluate the PCI slots that are necessary to be compatible with the NICs, and finally consider storage requirements for the host. The following pages discuss each of these considerations in detail.

Diagram</p>
<p>Description automatically generated

Physical NIC:

The physical NIC is one of the most important points to consider and therefore an ideal place to begin the selection process. When selecting a physical NIC, the factors you need to take into account are the required bandwidth and the connectivity requirements for resilience needed for the Bare Metal Edge.  These requirements will tell you how many cards and ports per card you will need.

First, you must validate whether a specific card is supported by the VMware NSX version that you want to use by checking the VMware official documentation. You must make sure that the Vendor ID and the PCI Device ID are the same as in the VMware documentation. The card must be compatible with the PCI generation and number of lanes or slots that you have in your server. For example, if you want to use a 100Gbps NIC that has 2x 100Gbps ports, make sure the PCI slots support 200Gbps bidirectional or 400Gbps aggregated. You must consider the maximum bandwidth supported by the PCIe bus of the server that you intend to use, as well as the maximum aggregated bandwidth you need from a specific card. You also need to validate whether the firmware version is supported on Bare Metal Edge (VMware NSX 4.1):

https://docs.vmware.com/en/VMware-NSX/4.1/installation/GUID-14C3F618-AB8D-427E-AC88-F05D1A04DE40.html

Bandwidth and throughput support of the NIC are critical to VMware NSX Edge. Look for NICs that provide sufficient bandwidth and throughput to handle your applications’ requirements. Consider factors like link aggregation (combining multiple NICs for increased throughput) or NICs with multiple ports for improved performance or distributing the traffic across different cards. Also consider whether you are going to use all the ports from the card and the maximum throughput you can have, as well as whether the PCI slot will satisfy the maximum expected bandwidth.

Another point is resiliency: When we are building our Bare Metal Edge, we need to consider whether the NIC card or PCI slot fails if the traffic passing through our BM Edge will be protected. In this specific case, you can distribute the ports from different cards so that even if the interface (the entire card) is down, the traffic passing through Bare Metal Edge is protected.

VMware NSX supports a maximum of 16 ports per Bare Metal Edge. There are two ways to scale bandwidth for our virtual network infrastructure. One of them is scaling with adding more NICs and the other is horizontal scaling by adding more Bare Metal Edges to our Edge Cluster and using ECMP.

CPU:

You first need to select a vendor for the CPU as Bare Metal Edge supports a broad list of processors from Intel and AMD. Regarding CPUs, there are two important components that we need to look at. The first one is the number of cores a CPU has. This is important for NSX Edge performance because it will define how many parallel tasks can be executed. On the Edge, the number of cores impacts the throughput you can support.

The second important component is the clock speed of the CPU measured in GHz, or how many cycles per second the CPU executes. It is important to note that clock speed is not the only factor that determines the overall speed and performance of the BME. Other factors, such as the number of cores, cache size, architecture, and efficiency of the CPU, also play a significant role. For example, two CPUs with the same clock speed may have different performance levels if one has more cores or a more advanced architecture.

When you are evaluating CPU options for your Bare Metal Edge the main factor you need to consider is the number of cores necessary to satisfy the throughput requirement. In our performance tests we have reached a maximum of 2.5 Mpps. Theoretically we can have more Mpps per core; however, our load is distributed across all cores. In a real–world production environment, your traffic will be distributed across all cores. When you calculate the number of cores necessary, keep in mind that 50% of the total cores from the Bare Metal Edge will be used for Datapath. In our lab scenario we are using 2x Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz with 28 cores, meaning we have 56 cores in total. Only 28 cores will be used for Datapath.

As of VMware NSX 4.1, we support up to 64 cores for the entire system. This means that for a server with a single socket, the CPU can have up to 64 cores. For a server with 2 sockets, each socket cannot have more than 32 cores. You can select between single socket or dual socket architecture, but we do not support having a quad socket. Please keep in mind that hyper threading is not supported, and it must be disabled in BIOS/UEFI.

Additionally, the performance of certain tasks can be affected by factors other than the CPU, such as memory speed and storage speed.

If you choose to use a dual socket server, you will need to consider the potential bottleneck created by the bus that is used to interconnect the processors. If the traffic comes from a PNIC connected to NUMA0 (Non-Uniform Memory Access, meaning a memory architecture where processors are directly attached to local memory) and needs to be sent out of a PNIC interface connected to NUMA1, the traffic will need to cross the NUMA, which can create an additional potential bottleneck. Regarding NUMA node, it is important to understand it depends on the CPU architecture, especially its memory bus design. A socket can contain one or more NUMA nodes with its cores and memory. A NUMA node will contain a set of cores and threads and memory that is local to the NUMA node. A core may have 0 or more threads. A socket refers to the physical location where a processor package plugs into a motherboard. In our case, the Bare Metal Edge has 2x CPU sockets and 2x NUMA nodes (NUMA0 and NUMA1). Understanding NUMA node architecture is important for building a Bare Metal Edge architecture, NIC to NUMA mapping,  determine in how the traffic will be forwarded.

A picture containing text, screenshot, diagram, line</p>
<p>Description automatically generated

Diagram 1

 

Edge will reserve all the cores for DPDK (DataPath processing) on NUMA0. The Data Plane Development Kit consists of libraries to accelerate packet processing workloads running on a wide variety of CPU architectures, Bare Metal Edge will use NUMA0 for processing packets. This means that if we have packets flowing over a NIC that is connected to NUMA1, they will always be forwarded to NUMA0 for packet processing.

You need to consider a couple of design points if you want to use LAG for overlay traffic. Please keep in mind that it is not recommended to use LAG for your Uplink interfaces to connect the Edge to the Top of the Rack router and rely on the L3 routing protocol over LAG. All interfaces of the LAG need to be connected to NUMA0. If you have interfaces connected to NUMA1 in the LAG, there will be no Transmit (TX) traffic on the interfaces mapped to NUMA1 (this behavior may change in the future versions). To avoid this, you can keep all LAG members on the same NUMA0. If you are using Bare Metal Edge having 2 NUMA nodes, you can still use interfaces connected to NUMA1 for your Uplinks. Please consider the cross NUMA traffic that was explained above. You do not have the option to pin a specific interface to a specific NUMA. For validating the mapping between interfaces and NUMA nodes you can log in with user: “admin” and password:“default” (if it is not changed) and run the command:

nsx-edge-1> get dataplane

Devices: Device_id : 0x07b0

Name : fp-eth0

Numa_node : -1

Pci : 0000:0b:00.00

Vendor : 0x15ad

 

Device_id : 0x07b0

Name : fp-eth1

Numa_node : -1

Pci : 0000:13:00.00

Vendor : 0x15ad

Or you can log in with user:”root” and password:“default” (if it is not changed) and check with the following commands:

root@ nsx-edge-1:~# lspci -D | grep 'Network\|Ethernet'

0000:19:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)

0000:19:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)

 

root@ nsx-edge-1:~# lscpi –vmm

Slot:   19:00.1

Class:  Ethernet controller

Vendor: Intel Corporation

Device: Ethernet Controller 10G X550T

SVendor:        Dell

SDevice:        Ethernet 10G 4P X550/I350 rNDC

Rev:    01

NUMANode:       0

 

If you opt for a single socket design, you may experience better performance in comparison to dual socket design. When selecting hardware, you can check the cross NUMA (Ultra Path Interconnect) max speed and the supported memory max speed. If you use a single socket design, the traffic will always be serviced on NUMA0.

You can check CPU support at https://docs.vmware.com/en/VMware-NSX/4.1/installation/GUID-14C3F618-AB8D-427E-AC88-F05D1A04DE40.html

Check if your hardware is showing as supported on the operating system page:

https://ubuntu.com/certified?category=Server&release=20.04%20LTS&category=Server

To decide how many CPU cores you will need, you must also consider the minimum and maximum supported cores for Bare Metal Edge. Keep in mind that you will need more cores as you use more services on the Bare Metal Edge. You can have a minimum of 8 cores and a maximum of 64 cores. Bare Metal Edge also requires a minimum of 32 GB of memory and 200 GB of storage. You can roughly calculate the number of cores you will need based on the maximum utilization you expect in terms of Gbps or Mpps. Please keep in mind that if you’ve reached the maximum supported cores (32 for Datapath), you can always scale your Edge Cluster horizontally.               

Memory:

The next component to consider is memory. Selecting memory with higher transfer speed (MT/s or mega transfers per second) will influence the maximum throughput you can achieve. When you select memory for your Bare Metal Edge, you need to make sure the memory is compatible with the CPU that you have selected. In our specific case, we have distributed the memory and we have populated all available DIM slots, resulting in 32 slots with xDDR-4, with 64GB memory modules, running at 2933MT/s. This allows us to achieve better performance.

Memory speed and size may impact traffic performance. In the official product documentation, you will find information about the minimum required memory of 32GB and recommended memory of 256 GB. In the test we executed, we had 2048 GB.

https://docs.vmware.com/en/VMware-NSX/4.1/installation/GUID-14C3F618-AB8D-427E-AC88-F05D1A04DE40.html

PCI Slots:

When selecting servers, you need to determine the PCI-E generation and whether the card that you want to use supports PCI-E slot generation 3 or generation 4 or generation 5. In Table 1-2, you will find theoretical maximum performance levels that you can achieve with different PCI generations and lanes.

PCI-E Generation

Lanes

*Theoretical Bandwidth (unidirectional)

**Typical Bandwidth
(in practice)

Gen 1

x4

8 Gbps

7 Gbps

Gen 1

x8

16 Gbps

14,1 Gbps

Gen 1

x16

32 Gbps

28,2 Gbps

Gen 2

x4

16 Gbps

12,8 Gbps

Gen 2

x8

32 Gbps

25,6 Gbps

Gen 2

x16

64 Gbps

51,2 Gbps

Gen 3

x4

32 Gbps

22,4 Gbps

Gen 3

x8

64 Gbps

44,8 Gbps

Gen 3

x16

128 Gbps

96,8 Gbps

Gen 4

x16

256 Gbps

209,6 Gbps

                                                                                                    

Table 1

*Theoretical is calculated bandwidth based on the number of lanes and the theoretical maximum speed.

**Typical bandwidth is the tested bandwidth

 

In our specific case, we executed the tests with Mellanox ConnectX-5 EX, PCI Device ID 0x1019 and used PCI Generation 4 16 lanes that can have 26,200MB/s or around 210Gbps per PCI-E slot. Please keep in mind that this is the bandwidth that can be used from a single NIC card. If you have more than one port on the card and you want to use both ports, you may require a more recent PCI generation. Another point that you need to consider is how many PCI-E slots you have for a specific NUMA.

Storage:

The only important point to remember regarding storage is that you need to have the minimum disk space for your Bare Metal Edge. When selecting storage, prefer storage that is local on the Bare Metal Edge. You also need to take into consideration that the storage controller must be supported by the Bare Metal Edge Ubuntu software:  https://ubuntu.com/certified?category=Server&release=18.04%20LTS&category=Server

 

Performance Tuning

In this section, we review different ways to change Bare Metal Edge settings to improve performance before going into production. Remember these recommendations should be made only in cases where you want to improve performance. You may need to change some settings to achieve the best performance, depending on the traffic and the application that you are running. It is highly recommended to conduct a traffic test first to check performance, confirm whether it meets expected results and then assess whether you need to tune any of the settings.

When changing any of the values below, you must validate the impact on the performance.

Ring buffer size

Ring buffer is a common building block to store packets that have been just received on the physical NIC or that are about to be sent. The Ring Buffer is a data structure used in computer programming to efficiently manage a fixed-size buffer or queue. It has a circular or ring-like shape, which means that when the buffer reaches its maximum capacity, new data will start overwriting the oldest data in a continuous loop. This circular behavior allows for efficient memory usage and avoids the need to resize or move data when the buffer is full. The diagram below illustrates the path of the packet and the ring buffering functionality.

A diagram of a ring buffer</p>
<p>Description automatically generatedA diagram of a ring buffer</p>
<p>Description automatically generated

Diagram 2

When receiving a packet, Edge node retrieves the next free particle and copies the packet data into the particle from the ring buffer. When the first particle is filled, Edge moves to the next free particle from the ring buffer, links it to the first particle, and continues copying the data into this second particle. If all the ring buffer particles are full, packet drops may result. For this reason, consider increasing Ring Buffer. VMware NSX has an alarm that will help you to identify when you need to change your Ring Buffer size and will indicate that you may have packet loss. To learn more, read this article:  https://docs.vmware.com/en/VMware-NSX-Event-Catalog/index.html#edge-health-events-13

Here is the procedure to update the ring buffer size:

Validate dataplane ring-size

Edge-node> get dataplane | find ring

Bfd_ring_size      : 512

Lacp_ring_size     : 512

Learning_ring_size : 512

Livetrace_ring_size: 512

Rx_ring_size       : 512  ---> current receive ring size

Slowpath_ring_size : 512

Tx_ring_size       : 512  ---> current transmit ring size

 

To change ring buffer size, you must set one of the following values 512/1024/2048/4096. Here is an example that sets it to 1024 (number of descriptors):

 Edge-node> set dataplane ring-size rx 1024

 Edge-node> set dataplane ring-size tx 1024

Dataplane service must be restarted to update the ring buffer size:

 Edge-node> restart service dataplane

*Please keep in mind restarting Dataplane service is impacting traffic forwarding

Flow Control

Disabling flow control may improve performance in certain scenarios. Flow control is a mechanism that allows network devices to regulate the flow of data between them, preventing data loss or congestion. However, there are cases where disabling flow control can be beneficial for performance optimization.

  • Buffering and latency reduction: Flow control mechanisms, such as the IEEE 802.3x standard for Ethernet, rely on buffering packets when the receiving device cannot handle the incoming data rate. This buffering introduces additional latency, as packets are held in the buffer until they can be processed. By disabling flow control, you eliminate the need for buffering and reduce latency, which can be critical for time-sensitive applications.
  • Avoiding packet drops: In some cases, flow control can result in dropped packets. When a receiving device becomes overwhelmed and sends a pause frame to the transmitting device, it effectively halts the data transmission until the buffer is cleared. If the buffer fills up faster than it can be emptied, packets may be dropped. By disabling flow control, you prevent the possibility of dropped packets, ensuring continuous data flow.
  • Optimizing for high-speed links: Flow control mechanisms were originally designed for slower network speeds. With advancements in Ethernet technology, especially with the emergence of 10 Gigabit and higher links, flow control may not be necessary or beneficial. The higher bandwidth available in these links can accommodate the data flow without the need for flow control, making it more efficient to disable it.
  • Predictable performance: Disabling flow control can lead to more deterministic behavior in the network. Without flow control, data transmission occurs without interruptions, allowing for more predictable performance characteristics. This predictability is particularly important in real-time applications, such as video streaming or VoIP, where consistent and low-latency data delivery is critical.

It's important to assess the network's needs, monitor performance, and conduct thorough testing before making any changes to flow control settings.

If you have flow control enabled, it may impact traffic performance. Pause frames are related to Ethernet flow control and are used to manage the pacing of data transmission on a network segment. Sometimes a sending node TOR or Edge may transmit data faster than another node can accept it. In this case, the overwhelmed network node can send pause frames back to the sender, pausing the transmission of traffic for a brief period. This will slow down the speed with which packets are forwarded and will impact performance. By default, flow control should be disabled on the Top of the Rack switch/router and NSX BM Edge.

Hyper-Threading

Enabling hyper-threading on Bare Metal Edges can lead to performance problems due to Datapath fastpath threads incorrectly sharing a physical core with another process or thread. Hyper-threading is always disabled (through a grub parameter) when you boot the Edge. You can check the hyper-threading setting by logging into the Bare Metal Edge as the ‘root’ user and checking the output of the command # lscpu | grep core. If the value is 1, this means that hyper-threading is disabled:

Edge-node# lscpu | grep core

Thread(s) per core:     1

Flow Cache

Flow Cache is an optimization enhancement that helps reduce CPU cycles spent on known flows. With the start of a new flow, Flow Cache tables are immediately populated. This procedure enables follow-up decisions for the rest of the packets within a flow to be skipped if the flow already exists in the flow table. If the packets from the same flow arrive consecutively, the fast path decision for that packet is stored in memory and applied directly for the rest of the packets in that cluster of packets. If packets are from different flows, the decision per flow is saved to a hash table and used to decide the next hop for packets in each of the flows. Flow Cache helps reduce CPU cycles by as much as 75%, a substantial improvement. We must make sure flow cache is enabled. VMware NSX 4.1.1 will show an alarm if flow cache is disabled.

Note: If you use the NSX Bare Metal Edge Redeploy API, you need to configure settings for performance tuning again since they will not be applied during redeploy.

NSX Configuration - The Best Performance

Uplink Profile

The uplink profile is a key element for maximizing the performance of the Bare Metal Edge. In our design guide, we share details on how you can use the Uplink profile for managing traffic forwarding through specific physical interfaces. The uplink profile will help you to achieve the maximum performance from your Bare Metal Edge by specifying which interfaces should be used for connectivity to the Top of the Rack switch and which physical interfaces should be used for sending overlay traffic. In the test case scenario discussed in this document, you will see the influence of having a single TEP and having multiple TEP or having a LAG.

1/ Uplink profile using load balancing source for TEPs: With a teaming policy, we will have our TEP traffic balanced through 2 uplinks. The Bare Metal Edge will be configured with 2 TEP IPs and will separate the TEP traffic and Uplink traffic. The uplinks are Uplink1 and Uplink2 (separated pNIC) that will help to pin the NSX Segments used for Uplinks (VLAN) and the traffic through them to exit via a specific physical NIC uplink. This will allow us to take advantage of the maximum utilization of the uplinks. The named teaming policy does not support having standby Uplink.

  

Diagram 3

2/ Uplink profile using LAG: With this teaming policy we can have a Link Aggregation Control Protocol on the Bare Metal Edge that has two or more interfaces balancing the traffic from and to TEP. The TEP traffic will be equally balanced across interfaces. This teaming policy is good for Bare Metal Edge TEP traffic only but NOT for uplink traffic to Top of the Rack routers. It is not advisable to have LAG/MLAG and balance the control plane traffic. We are using the named teaming policy for connectivity to the Top of the Rack router so that we can pin the traffic to two separated physical uplinks. This teaming policy is tested in scenario 2.

                   

Diagram 4

A screenshot of a computer</p>
<p>Description automatically generated

A screenshot of a computer</p>
<p>Description automatically generated

Additional details about uplink profiles and edge connectivity are available in the NSX reference design guide:

ECMP Routing

Equal cost multi-path (ECMP) will increase the north and south bandwidth by adding more paths and balancing the traffic across them. The ECMP routing paths are used to load balance traffic and provide fault tolerance for failed paths. The Tier-0 logical router must be in active-active mode for ECMP to be available. A maximum of eight ECMP paths are supported. The implementation of ECMP on NSX Edge is based on the 5-tuple of the protocol number, source address, destination address, source port, and destination port. The algorithm used to distribute the data among the ECMP paths is not round robin. Therefore, some paths might carry more traffic than others. Note that if the protocol is IPv6 and the IPv6 header has more than one extension header, ECMP will be based only on the source and destination addresses.

Jumbo MTU for Higher Throughput

Maximum Transmission Unit (MTU) denotes the maximum packet size that can be sent on the physical fabric. When setting this configuration on the ESX hosts and the physical fabric, Geneve header size has to be taken into consideration. Our general recommendation is to allow for at least 200 bytes of buffer for Geneve headers in order to accommodate the option field for use cases such as service insertion. As an example, if the VM’s MTU is set to 1500 bytes, the pNIC and the physical fabric should be set to 1700 or more. Our recommendation for optimal throughput is to set the underlying fabric and ESX host’s pNICs to 9000 and the VM vNIC MTU to 8800. Please keep in mind that if you allow Jumbo MTU on the Bare Metal Edge interfaces, you must make sure Jumbo MTU is allowed everywhere in the network infrastructure.

Test Case Scenario and Topology

Topology

In this section, we will share more details about the hardware that is used in this test case scenario. We will also share details about the settings for each test case scenario. The performance tests are executed with a Bare Metal Edge using the hardware specification described in Table 2:

CPU

Memory

NIC

PCI-E

2x Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz

32x DDR-4 64GB 2933 MT/s

 

NIC: Mellanox ConnectX-5 EX

PCI_DEVICE_ID_MELLANOX_CONNECTX5EX

PCI Dev ID: 0x1019

Firmware: 16.31.20.06

Gen4

Table 2

NSX Version: 4.0.1.1.0.20598726

BM Edge version: 4.0.1.1.0.20598726

Ring-Buffer size is increased to 4096 for Tx and RX/Hyperthreading is disabled by default/Flow control is disabled.

The path of the packets for our topology is: The tester generates 1,280 UDP flows with different IP sources and destination. Destinations are the VMs connected to the overlay segments. The traffic is forwarded from the tester to TOR and TOR will send the traffic to the Tier-0 gateway. VMs will loop the traffic back to the tester and will send the traffic to its default gateway with the tester as destination IP. The Tier-0 gateway will forward the traffic through its uplinks to the Top of the Rack Switch (Switch 100G). The tester is measuring the throughput of generated and received traffic.

Topology:

 

Logical Topology:

A diagram of a computer network</p>
<p>Description automatically generated

Test Case Scenario 1: Single Uplink/Single Downlink

We used a single Bare Metal Edge with two Mellanox ConnectX5 EX cards, each with 2x100Gbps ports. The tests leverage one port of each card for sending traffic: one PNIC for Uplink to the TOR and one PNIC for Downlink for the traffic to the VMs.

In the diagram you will see that in this test scenario we are using one Uplink in VLAN 3511 and one Downlink, which is used for Multiple Overlay Segments where the VMs are connected. The segments where the VMs are connected are downlink interfaces attached to a T0 Gateway. We are generating 100Gbps from the tester. The traffic is symmetric and generates 1280 flows to the 20 VMs, and the VMs are looping the traffic back to the tester. This test measures how many packets are sent and received, the throughput that is matching 0.01% frame loss tolerance, and the latency.

The success criteria for this test are having all packet sizes pass the tests with the following conditions:

  • Packet size: 256B; 512B; 1024B; 1500B
  • Frame loss tolerance: 0.01%
  • Maximum Rate limit: 100% of the Physical Interface (100Gbps)
  • Min Rate Limit: 1% of the Physical Interface
  • Time for execution: 90s
  • Selected stream: 1,280 UDP Flows
  • Number of retry: 3

 

Test Scenario 1 topology

Diagram 5

Results:

Packet (Byte)

Throughput (Gbps)

Aggregated Throughput (fps)

Geneve (Byte)

BME Geneve Egress traffic (Gbps)

Total BME Traffic (Ingress+Egress)

256

79.8 Gbps

36,146,004

314

89.4 Gbps

178,8Gbps

512

89.5 Gbps

21,014,246

570

95.1 Gbps

190,2Gbps

650

91.7 Gbps

17,110,541

708

96.4 Gbps

192,8Gbps

1024

94.7 Gbps

11,340,793

1082

97.6 Gbps

195,2Gbps

1500

95.7 Gbps

7,874,434

1558

98.2 Gbps

196,2Gpbs

Table 2

 

Test Scenario 2: Two Uplinks/Two Downlinks in LAG

We have a Bare Metal Edge with four Mellanox ConnectX5 EX cards with 2x100Gbps ports per card. We used one port of each card for sending and receiving traffic. In total, 4 cards and 4 ports were used: two ports on different cards for Uplinks to the tester and two ports on different cards for Downlinks configured in LAG for the traffic to the VMs. The two cards used for LAG are configured on the same NUMA node (NUMA0). We are generating 200Gbps from the tester to the Uplinks. The Server has only 2 PCI 16 lane slots connected to each NUMA, which means NUMA0 has only 1 PCI slot with 16 lanes and NUMA1 has 1 PCI slot with 16 lanes. Thus, the hardware limits impact the results below. PCI GEN 4 16 lane can have 209 Gbps while PCI GEN 4 8 lanes can have up to 96Gbps. We are using  one interface from PCI GEN 4 16 lane and one interface from the card connected to PCI GEN 4 8 lane, so for the LAG we can have ~196 Gbps.

The same combination exists for Uplinks: one interface from PCI GEN 4 16 lane and one interface from the card connected to PCI GEN 4 8 lane. As a result, for the Uplinks we can support ~196 Gbps. We must consider the fact that traffic will be sent across NUMA as we have 2 ports attached to NUMA0 (in LAG for Downlink) and 2 ports attached to NUMA1 (for Uplinks). The traffic is symmetric with 1,280 flows (UDP flow using different combinations of source and destination IP addresses) to the 20x VMs, and the VMs are looping the traffic back to the tester. The tester measures the number packets sent/received and the throughput that has up to 0.01% frame loss tolerance.

The success criteria for this test are to have all packet sizes pass with the following conditions:

  • Packet size: 128B; 256B; 512B; 1024B; 1500B
  • Frame loss tolerance: 0.01%
  • Maximum Rate limit: 100% of the Physical Interface (200Gbps)
  • Min Rate Limit: 1% of the Physical Interface
  • Time for execution: 90s
  • Selected stream: 20x64
  • Number of retry: 3

 

Test Scenario 2 topology

Diagram 6

Results:

Packet Size Sent by Tester (Bytes)

Throughput Sent by Tester (Gbps)

Aggregated Frame Rate (fps)

Geneve Packet Size (Bytes)

BME Geneve Egress traffic (Gbps)

Line rate Utilization (%)

CPU Utilization

28 DP Cores

Total BME Traffic (Ingress+Egress)

256

29.8 Gbps

13,5M

314

34.0Gbps

17%

90-94% 1.1Mpps-1.4Mpps/core

64,0Gbps

512

135.7 Gbps

31,8M

570

145.9Gbps

73%

93-94% 2.0Mpps-2.4Mpps/core

291,8Gbps

650

161.4 Gbps

29,9M

708

169.3Gbps

85%

91-94% 1.1Mpps-1.4Mpps/core

338,6Gbps

1024

184.7 Gbps

22,1M

1082

191.4Gbps

96%

90-94% 1.4Mpps-1.7Mpps/core

382,8Gbps

1500

189.6 Gbps

15,5M

1558

194.Gbps

97%

87-90% 0.7Mpps-0.8Mpps/core

388,0Gbps

Table 3

 

Note: Two Uplinks/Two Downlinks dual TEP detail

When discussing Bare Metal Edge and TEPs we need to consider the design case at which we have 2 physical interfaces for Downlinks and are using 2 TEP IPs. In this case the traffic is using both TEPs; however, it could be unequally balanced across both TEP IPs. In the case of a heavy flow approaching 99Gbps (and we have 100Gbps NIC) that is using one of the TEPs, we are near to the line rate (100Gbps), which may cause instability of the traffic forwarding. It is therefore necessary to plan lower the percentage of the line rate utilization when we use 2xTEPs and 2 NICs. If we have Bare Metal Edge, we see better balancing when we use LAG for TEP IP and the traffic will be equally balanced across all physical NICs. Please keep in mind this is not the case when we use Edge VM, because the line rate of the physical NICs is higher than the maximum performance for a single Edge VM.

Diagram 7

Conclusion and Analysis

It is important to carefully consider each hardware component that you will use so that you can maximize the performance for Bare Metal Edge. This white paper showed results with different test case scenarios and discussed cases utilizing the maximum throughput that we can achieve from the host and the NIC using RFC2544 with aggressive parameters. The current test has been conducted with a Single BM Edge in the edge cluster. To scale performance, there are two options: consider scaling up by adding more NICs (if you do not have high CPU capacity) or horizontal expansion of the Edge Cluster.

Horizontal scaling or scaling out of the Edge cluster is one of the most frequently used ways when we need more bandwidth throughput or computing power. We are supporting up to 10 NSX Edge Nodes in the same Edge cluster and we can have up to 8 Edges used for ECMP. It is easy to add a new Bare Metal Edge with the same hardware and using VMware Active/Active design options. Even with Stateful services enabled, VMware NSX supports a Stateful Active/Active topology where the traffic is forwarded through all Edge Clusters. When scaling the Edge Cluster, it is important to use a consistent hardware configuration so all edge nodes in the cluster are homogenous and deliver the same performance. 

If we have sufficient CPU resources, we can easily insert more NIC Cards and scale out the performance of our BM Edge.

As our results show, we can achieve line rate throughput with BM Edge and have ~197 Gbps (aggregated throughput) when using only 2x 100Gbps interfaces.

 

Filter Tags

Networking NSX Document Deep Dive Technical Overview Intermediate Advanced