Feb 112012
 

I have fielded a number of questions on VMware’s multicast support and figure it is time I did a short blog on it. There is a good white paper on the topic on the VMware site called Multicast Performance on vSphere 5.0 that deals with performance changes that have been made to enhance multicast support in vSphere 5.

The recurring question I get is how multicast is handled in vSphere. The short answer is the vSwitch does not play a role in the IGMP join and leave messages that the VMs send in order to start and stop receiving multicast groups respectively.

The vSwitch, and this is for both for the Standard Switch (VSS) and vSphere Distributed Switch (vDS), have an inherent knowledge of the configuration of the VM’s Virtual NIC (vNIC). Typically when a Guest OS is interested in receiving traffic from certain multicast group, the network stack in the guest OS pushes down the corresponding multicast MAC address to the vNIC. The vSwitch gets this multicast MAC address from the vNIC emulation directly and tracks it in the forwarding table. When the Guest OS sends out IGMP Join/Leave messages, the vSwitch does not interpret them and forwards them to the physical switch (pSwitch) which makes the usual decision on accepting the join, or not, based on it’s configuration. This is possible on the pSwitches because they use IGMP snooping and keep track, on a per physical port, what multicast groups to send out of each port.

When multicast traffic comes into the vSwitch, the vSwitch forwards copies of the traffic to subscribed VMs. Forwarding for multicast traffic is done the same way as unicast, and is based on destination MAC address. Since the vSwitch tracks which vNIC is interested in which multicast groups, it delivers packets only to right set of VMs. In this way the vSwitch does not deliver packet to all VMs but only to vNICs that match the forwarding table lookup.

If a VM leaves a multicast group, it will send a IGMP leave message which will be forwarded to the pSwitch and then removes the multicast MAC address from it’s vNIC to stop receiving the stream. The vSwitch will then remove the vNIC from the forwarding table for that multicast group. If the VM in question was the last one on the ESXi Server that had requested the multicast group, the pSwitch will also remove the group from the list of multicast groups to send out of the physical port.

What if the VM is vMotioned?

When a VM is vMotioned, it’s vNIC configuration goes with it. The destination hosts sees this vNIC configuration and updates it’s forwarding tables to forward the necessary multicast traffic it receives to the VM. To prevent any transient multicast packet loss after a vMotion, the vSwitch also injects an IGMP query into the VM, using its unicast MAC address, so that multicast receiver presence is known to the pSwitches immediately. This avoids the VM missing multicast traffic by having to wait for next IGMP query to come from a IGMP querier on the network.

The IGMP querier is usually a router on the network and is required in order for IGMP snooping to work on pSwitches. The pSwitches use this information in their multicast forwarding tables and without it would not be able to do IGMP snooping. The routers send out IGMP queries to address 224.0.0.1, all-systems multicast group, and the VMs that have subscribed to a multicast group respond with a membership report listing the groups they are participating in. The pSwitch snoops this information and updates its multicast forwarding tables to starts forwarding the multicast groups for the VM.

How about Physical NIC Teaming

Physical NIC teaming is supported but how it works is dependent on the type of load balancing scheme used.

If the physical NICs are all active and the teaming is virtual source port ID or MAC hash based, then the VM’s IGMP join messages will go out of the configured physical NIC and the corresponding pSwitch will update its multicast forwarding tables to send out the multicast group to the VM on the associated physical port.

For the case where one of the physical NICs starts out in standby mode and VMs are failed over to it. The vSwitch will, like in the vMotion case above, inject IGMP queries into the VMs affected by the failover so that multicast receiver presence is known to the pSwitch immediately to allow packets forwarding.

In the case of link aggregation that uses IP hash for load balancing, the pSwitch treats the pNICs as one channel and will fail the multicast traffic between the pNICs as they are all subscribed to the same groups. The pNIC used to send the multicast traffic to the vSwitch will depend on the pSwitch load-balancing scheme. Keep in mind that to use link aggregation with multiple pSwitches, the switches need to be a stack in order to look like a single switch to the ESXi servers.

In Closing

Multicast traffic is not one of those things that is talked about a lot in general virtualization implementations. There is great support for it now in vSphere with performance constantly being improved.

The main driving force behind multicast in virtualized environments is probably financial institutions that rely heavily on it for streaming of things like market data and video. With the use of 10GB NICs and performance improvements in multicast handling, there is now very little stopping the virtualization of even the most demanding of multicast based applications.

Mar 232011
 

Introduction:

Networking in VMware Cloud Director (vCD) is the most complex and least understood component of its architecture.  Among the different networking options was the introduction of vCD Network Isolation (vCDNI).  In this blog I am going to attempt to give you a better understanding of vCDNI and how it works technically.

Background:

vCDNI is an overlay networking scheme that builds on top of an existing Ethernet network to provide isolated networking for Virtual Machines.  An overlay network is a virtual network of devices or connection that rides on top of an existing networking infrastructure.  An example would be a data connection that rides on top of a voice connection in a telephone network.

The scheme used for vCDNI is a MAC-in-MAC encapsulation that allows for many isolated networks, up to 16 million in vCD 1.0, to be created and isolated on top of one Ethernet network.

If you are familiar with Lab Managers networking, vCDNI is based on the same technology used in Host Spanning Private Networks found in LabManager 4. See LabManager4_Whats_New.pdf for details on this.  VM traffic encapsulation is done in the fast data path by a vCDNI dvfilter operating in the vmkernel.  There is no VM operation required for this.  LM required a service VM to be present in the control path whereas VCD does not require such a VM.

Technical Breakdown:

vCDNI networks are implemented as portgroups in vSphere with each network designated a unique network ID, also know as a Fence ID.  In vCenter the portgroup would have a name like:

dvs.VC1055813345DVS2CM1-V99-F12-Emca Internet

The breakdown of the portgroup name is:

VC1055813345      vCenter ID
DVS2              Distributed switch number
CM1               vCD Instance
V99               Transport VLAN
F12               Network (Fence) ID
Emca Internet     vCD Network Name

vCD assigns a unique Network ID to each vCDNI network and uses a different vSphere ephemeral portgroup for each network. With ephemeral portgroup, a virtual port will be created and assigned to a Virtual Machine when it is powered on, and will be deleted when it is powered off. An ephemeral portgroup has no limit on the number of ports that can be a part of this portgroup other than the vCenter supported limit.  As such the practical upper bound for concurrent vCDNI networks is the max supported ephemeral portgroups in vCenter, currently 1016 in vSphere 4.1.

The isolation inherent in vCDNI is achieved by re-encapsulating the original MAC frames from Virtual Machines in a vCDNI frame to create a new MAC-in-MAC frame. Part of the new vCDNI header is a network ID that identifies which isolation network the packet belongs to.  The final frame looks like this:

The vCDNI MAC header has the assigned virtual MAC addresses of the destination ESX(i) server and identifying information of the source ESX(i) server corresponding to the destination and source Virtual Machines respectively.

The vCDNI data header contains vCDNI protocol specific data such as sequence and version numbers and vCDNI Network ID.

The following example depicts the vCDNI packet flow:

  • Virtual Machine sends packet out of it’s virtual NIC
  • dvfilter adds vCDNI MAC and Data headers to create the overlay vCDNI network
  • DVS adds the transport VLAN tags to the packet, if needed, and passes it on to the physical NIC
  • Destination ESX server Physical NIC gets the packet
  • DVS strips off the transport VLAN tags and passes it up to the dvfilter
  • dvfilter strips the vCDNI protocol data and passes the packet on to the destination VM

 

The added vCDNI header increases the maximum size of the frame by 24 bytes.  To ensure that there is no fragmentation due to oversized frames, the Ethernet Maximum Transmission Unit (MTU) needs to be increased by 24 bytes from its default of 1500 bytes to 1524 bytes. The MTU increase needs to be implemented end to end. This MTU change needs to be done in three places:

  • At the vCD level, by changing the Network Pool MTU

  • At the vCenter level, by modifying the advance properties of the DVS hosting the Network pool

  • At the physical switches, depending on the switch implementation, by increasing the MTU of the switch or ports used by the DVS hosting the network pool

a.  On a Cisco Catalyst switch

Catalyst_Switch(config)# system mtu 1524
Catalyst_Switch(config)# exit
Catalyst_Switch# reload

b.  On a Cisco Native IOS switch

NativeIOS_Switch># int gigabitEthernet 1/1
NativeIOS_Switch># mtu 1524

Implementation Consideration:

vCDNI is the optimal option in any environment where you need to create Virtual Machine networks without consuming VLANs in the process.  I would recommend that vCDNI network pools back all Organization and vApp networks. (See my previous blog on the definitions).  This would allow for the definition of thousands of networks without the actual consumption of VLANs, other than for the transport network.

By implementing the vCD dynamic portgroups on an “Organization” DVS and the static portgroups on a “Provider” DVS you will be able to isolate the dynamic portgroups to just the “Organization” DVS and make the MTU changes only on the infrastructure supporting the “Organization” DVS.  Doing this will also allow you to maximize the number of ports you can create in one vCenter as the “Organization” DVS will use ephemeral portgroups and the “Provider” DVS will use dynamic, preferably, or static portgroups.

Security Consideration:

The vCDNI transport network needs to be an isolated set of switches or a dedicated non-routed VLAN.  For practical reasons a dedicated non-routed VLAN will scale much better in a large vCD deployment.

The argument behind the isolated switches or non-routed VLAN recommendation is that vCDNI does not encrypt its packets and just overlays the different isolated networks on top of the transport network.  If a physical machine, or Virtual Machine for that matter, has access to the transport network, it would be trivial to disassemble the packets, as well as inject packets into any of the isolated networks.  This is not unique to vCDNI though, as Ethernet packets are not encrypted either, making it trivial to spoof Ethernet packets if you have access to the transport network.  The same applies to VLANs by the way, as access to the VLAN transport network (Native VLAN of the trunk) will allow you to inject packets into any VLAN allowed on that trunk.

If you are not using vCDNI backed pools, I would strongly recommend that you isolate all vCD networks by using unique VLANs for each network to ensure tenant isolation.

It is very important to understand that one of the biggest risks to the virtual environment is misconfiguration, not the technology. Thus you need strong audit controls to ensure that you avoid misconfiguration, either accidental or malicious.

Summary:

The picture below shows logically where in the traffic flow the dvfilter is inserted

To get per host vCDNI statistics use:

/opt/vmware/vslad/fence-util

This utility can be used to display:

  1. Configured networks and their MTUs
  2. Active ports and their port IDs
  3. Switch state including inside and outside MAC addresses
  4. Port statistics on a port ID basis

For help with the command, type it with no options.

Acknowledgements:

This blog would not have been without the technical assistance of Anupam Dalal  and feedback of Michael Haines.