100G NIC Evaluation

 

100 Gig network links are getting more and more common in networking nowadays, and the need to do various analytics on these links is getting more and more attention. As current single CPUs are not able to process network data at this rate, intelligent NICs are needed that allow to leverage multiple cores for analysis. We tested one of these solutions to evaluate the performance needed to do analysis on 100Gbit traffic

 

There are many analysis approaches leveraging information from L2-4, and in some cases even deep packet analysis up to L7. One example of possible analysis tools at L4 is ARGUS, which creates netflow like flow data from data streams, or tstat with a similar approach. These tools define flows as the 5 Tupel of IP source and destination, source and destination port, and protocol in use.

Modern multicore servers can provide enough compute to deal with analysis at these speeds, with a large number of cores and memory available. The problem is that even the most efficient analysis tools are unable to process 100g of traffic on only one CPU Core, so some form of multiprocessing is required. To spread the load over multiple cores in parallel, and by to leverage the full compute available on the servers, the incoming stream must be loadbalanced across multiple cores and the processes running on them.

One approach to tackle this problem are specific NIC designs, leveraging FPGA’s to provide pre-processing of the input stream at line rate independent of the Servers resources. These designs then expose the input stream as several virtual NICs that processes can attach to. Multiple hashing algorithms make sure that flows are always mapped to the same processes still allowing TCP reassembly The NICs also allow to filter out specific traffic that is not needed for the use case at hand, or drop portions of the payload if that data is irrelevant.

Napatech is a company that provides solutions that satisfy these needs. Their accelerator cards with up to 2*100G ports use FPGAs to pre process the input stream, and expose the data as multiple virtual NICs, which the analysis processes can bind to. Their cards are able to process 100G streams without packet loss even for minimum packet sizes, and transfer this data vie PSIx3 into server memory.

Our test environment consisted of a Dell Server with two Numa Nodes (CPU’s), each containing eight physical cores. The server has 128 G RAM installed, and supports PCIx3 to get the data from the NIC to memory. We installed a 1 port Napatech card in our server. Additionally, we used an Ixia traffic generator to provide us with test traffic at 100G, allowing us to vary the number of flows simulated, and to vary the packet sizes used in the test stream.

In our tests, we wanted to verify what performance is needed to analyze a 100G data stream using ARGUS on a standard intel server, and what factors influence the performance needed. The server we used provided 16 cores @ 3.2 GHz in two CPUs cores (NUMA Nodes) and 128 GB of ram.

We decided to look at the following parameters that we deemed most interesting:

1      Scaling of the number of cores:

  1. We tested average internet size packets with 256 flows against 1-8 cores, to see how processing scales with compute available
  2. We also used the data gathered to look how well the load balancing works

2      Packet size used in the streams:

  1. We decided to test against minimum size packets (64Byte), 750 Byte packets (close to the average internet packet size), and maximum size packets (1500Bytes), all streams at 100Gbit.

3      Number of flows simulated:

  1. We simulated different numbers of endpoints talking to each other, scaling the number of flows from 256, 65536, 1048576. We used again 750 Bytes packets and 8 cores for this measurement.

In all tests we sent traffic from the IXIA to the server NIC at full line rate of 100G/s.

Here is a sample of the results we obtained: 

 

  1. 1.    Scaling the number of cores:

We decided to measure the scaling of total packets per second, using an average internet packet size of 750 Bytes, and a medium number of endpoints (16384 flows). We generated an 100 Gbit traffic stream for 10 minutes, and measured per core pps, total pps, and max pps achieved. At this setting, the traffic generator produces roughly 16M packets per second.

We saw that we are able to process all packets without any drops reliably with 8 cores, which gives us an estimate of 2.5M pps per process / core as reliably doable. The max number of pps reached during measurement with 7 or less cores (dropping packets during the run) gives us an estimate of a upper, short term maximum of pps at around 3M pps. We can see that with 6 cores we are able to nearly process all packets generated. We still experience drops because some of the process are overloaded for a short amount of time by loadbalancing fluctuations.

 

 

We can also see that scaling of packets per second with cores seems to be nearly linear. Each core adds around 2.4M pps, up to a total of 16M pps with eight cores. 16M pps is the full data stream at 100G with our settings:

 

 

Looking at packets processed per core, we see that the load balancing works reasonably well.

 

 

 

  

 

  1. 2.    Impact of the number of concurrent flow entries

 

From the perspective of ARGUS, one TCP connection is translated into one flow. As we use static TCP source and destination ports, between each set of endpoints one flow is created.

We looked how the number of flows influences the performance. We did this with 16 cores attached, and with medium and large packet sizes. With medium size packets we can process 32768 flows without any drops. With 65536 flows we start dropping packets in this scenario. Moving to large packets (1500 bytes), we are able to process double the amount of flows without dropping packets (131072 flows). We see drops when doubling this number again (262144 flows). As expected, the larger the flow tables in Argus, the more compute is needed to sort and add information into them.

 

  1. 3.   Running two threads per Core

 

The last interesting factor is looking at Hyperthreading, and whether this could be beneficial. To analyze these influence, we ran the same traffic with 16 processes, being able to run one per cpu, to 32 processes running two threads on each cpu. The traffic pattern were minimum size packets and medium number of flows, and the total compute was not sufficient to process the data without drops.

The result is that doubling the number of Threads in this scenario did not provide any performance gain. It seems that the one process per core is able to fully load the core, there is no benefit from using multithreading, a result that is in sync with several other publications, or recommendations from NIC vendors.

 

Number of processes:

pps/percore

Mbps/core

pps-total

Mbps-total

16

4462051

2855

71392816

45680

32 (hyperthreading)

2218373

1419

70987936

45408

 

Conclusion:

The result from this test is that the pps processed per core are fairly stable, and with 8 cores we are able to process a 100 Gbit stream. Dropping to the smallest possible packet size would need more than a magnitude more compute power, so roughly about 80-100 cores, to be able to process 100Gbit without drops. 

 

The limiting factor in our tests was the amount of compute available. This means that a sensor for processing high bandwidth flows should have the highest number of cores and core frequency available.

Good next steps would be to test a sensor like this against real world traffic to see what bandwidth and number of flows can be expected in real world scenarios.

It would also be beneficial to see how other analysis applications behave, and whether we can reach similar performance values with them.

 

This evaluation was supported by IU’s whitebox switch project (supported by NSF EAGER grant #1535522) and the Netsage project (NSF award #1540933). Our special thanks goes to Napatech, which provided two 100G Nics for this evaluation.