In my last post, I talked about the architecture of Hyper-V Virtual Switch (VMSWITCH), that powers some of the largest data centers in the world, including but not limited to Windows Azure. In this post I would talk about how it is able to meet the networking performance requirements of the demanding workloads that runs in these data centers.
VMSWITCH provides an extremely high performance packet processing pipeline by using various techniques such as lock free data path, using pre-allocated memory buffers, batch packet processing etc. In addition, it leverages the packet processing offloads provided by underlying physical NIC hardware. These offloads do some of the packet processing in NIC hardware, thereby reducing the overall CPU usage and providing a high performance networking. If you are unfamiliar with NIC offloads, you may want to first read about them here and here.
VMSWITCH supports various NIC offloads such as Checksum, Large Send (LSO) and IPsec task offload. In rest of this post, I would talk about the high level architecture for supporting these offloads.
There are multiple types of NIC objects in VMSWITCH, to support legacy operating systems as well as high performance enlightened operating systems. The legacy NIC is used only for scenarios such as PXE boot and older operating systems. The legacy NIC is not capable of supporting NIC offloads because we are emulating a hardware that did not support such offloads. The other three types of NIC support NIC offload as shown in picture below.
VMSWITCH advertises offload support to the network stack for the host and VM virtual NICs. For Windows OS, this is done by advertising offload support using NDIS primitives. In case of VM virtual NIC, the appropriate offload capability information is sent via VMBUS messages and in the VM it is converted to NDIS objects and advertised to the VM network stack by the virtual NIC driver. For Linux OS, the enlightened virtual NIC driver in VM (referred as NetVSC) converts the offload capability information in Linux specific format and advertises it to Linux network stack.
VMSWITCH is a consumer of offload capability of the physical NIC. When a vSwitch is created and it is attached to a physical NIC, VMSWITCH queries the offload capabilities of the NIC and stores this information in the respective NIC object.
This way, VMSWITCH builds the offload capability map of each NIC that is connected to a particular vSwitch. There are some offloads, that VMSWITCH always advertises to host and VM virtual NIC, these are mainly checksum and large send offload, while there are other offloads such as IPsec task offload, that is only advertised if physical NIC supports that offload. This is described below in detail.
Send Checksum and Large Send Offload Handling
These offloads are always advertised as supported by VMSWITCH to Host and VM Virtual NIC. The reason is that before NDIS 6.0, there was no way to dynamically advertise support for these. The other reason was that these are required to be supported by physical NICs in Windows, hence it was guaranteed that underlying physical NIC would always support these. The handling of these is done in a way that if the offload is not available, for any reason, such as a user disabled these on the physical NIC (e.g. by using NetAdapter powershell cmdlets or going to advanced configuration of the physical NIC in Windows) or the packet is going from one VM to another VM etc., then VMSWITCH emulated the offload in software.
When network stack in the host or VM sends a packet, they put a metadata in the packet indicating whether checksum calculation or segmentation (for large send offload) is needed on the packet. This information is sent to VMSWITCH either via NDIS or VMBUS (based on host or VM virtual NIC). VMSWITCH, first processes the packet, as it normally does, i.e. apply ingress, forwarding and egress policy steps. Once it calculates the destination vPort list for the packet, it checks whether the destination NIC has capability to carry out the requested offload or not. Now only packets going to physical network via physical NIC can leverage hardware offload. So if the packet is not going to physical NIC or going to physical NIC that is not capable of doing this offload, VMSWITCH carries out the offload in software and sends the packet to the NIC.
Since the primary scenario for I/O is on the physical network, this works well, where hardware NIC offload is leveraged and for packets going to host or VMs, the software offload is performed. The software offload performing in VMSWITCH is no worse than doing it in the host or VM network stack, since anyways this work had to be done by the CPU. In fact, it is a little better to do in VMSWITCH because in case of LSO, it allows VM to send more than MTU size frames to VMSWITCH, resulting in fewer packet transfers between VM and VMSWITCH.
Receive Checksum Offload
This offload is always advertised as supported by VMSWITCH to host and VM virtual NIC. This offload allows a physical NIC to verify IP, TCP, UDP checksum in the packet and indicate (via packet metadata) whether the checksum in the packet is valid or not. VMSWITCH leverages this data for packets coming from physical NIC and transfers it to host or VM virtual NIC. Also note that, if VMSWITCH does a software offload of a send checksum request such as case where a packet from VM is going to another VM or to host virtual NIC or vice-versa etc., it marks the receive checksum bits as verified. This avoids unnecessary checksum validation in host or VM network stack and reducing the overall CPU usage.
IPsec Task Offload
IPsec task offload is more complicated, mainly because its computation in the host can result in host partition consuming a lot of CPU on behalf of the VM. This can allow a VM to cause a lot of CPU consumption in the host that is not attributed to the VM and result in kind of a DOS attack on the host. IPsec task offload support in Windows allows the offload capability to be dynamic and it also allows a per destination based security association (or SA) thereby allowing VMSWITCH to only allow offload of IPsec encryption to destinations that are reachable via a physical NIC that has IPsec task offload support.
The IPsec task offload support is dynamically calculated and advertised to the virtual NICs based upon the capability of the underlying physical NIC. Also when a VM requests a new security association for offload, it is validated and only allowed, if the destination would be reachable via physical NIC. If for any reason, the destination changes, e.g. a remote VM that was accessible via physical NIC, live migrates to the same host as the sending VM, VMSWITCH uploads the security associations from the sending VM, so that the sending VM starts doing IPsec encryption itself and not rely upon offload of the same.
In this post, I gave a high level description of how offloads are used in VMSWITCH to provide high performance network device virtualization. In future posts, I would talk about other performance techniques such as VMQ, mostly lock free and allocation free data path etc.
This posting is provided “AS IS” with no warranties and confers no rights.