Jun 272012
 

Got notification today that my session has been approved for VMworld in both San Francisco and Barcelona.  The session details are:

Session ID: INF-VSP1196
Session Title: vCloud Director Networking Deep Dive
Track: Infrastructure

If you are going to be at either venue try and attend my session to get you vCloud Networking questions answered.

As I am about to start putting the material for this together, feel free to leave comments with things you would like to see covered in the session, and I will try and include them.  I am hoping to make this as worth while as I can for everyone that attends.

I would like to publicly thank all of you that took the time to vote for my session during the public voting.  I am sure you are the ones that made the difference, see you at VMworld.

Feb 112012
 

I have fielded a number of questions on VMware’s multicast support and figure it is time I did a short blog on it. There is a good white paper on the topic on the VMware site called Multicast Performance on vSphere 5.0 that deals with performance changes that have been made to enhance multicast support in vSphere 5.

The recurring question I get is how multicast is handled in vSphere. The short answer is the vSwitch does not play a role in the IGMP join and leave messages that the VMs send in order to start and stop receiving multicast groups respectively.

The vSwitch, and this is for both for the Standard Switch (VSS) and vSphere Distributed Switch (vDS), have an inherent knowledge of the configuration of the VM’s Virtual NIC (vNIC). Typically when a Guest OS is interested in receiving traffic from certain multicast group, the network stack in the guest OS pushes down the corresponding multicast MAC address to the vNIC. The vSwitch gets this multicast MAC address from the vNIC emulation directly and tracks it in the forwarding table. When the Guest OS sends out IGMP Join/Leave messages, the vSwitch does not interpret them and forwards them to the physical switch (pSwitch) which makes the usual decision on accepting the join, or not, based on it’s configuration. This is possible on the pSwitches because they use IGMP snooping and keep track, on a per physical port, what multicast groups to send out of each port.

When multicast traffic comes into the vSwitch, the vSwitch forwards copies of the traffic to subscribed VMs. Forwarding for multicast traffic is done the same way as unicast, and is based on destination MAC address. Since the vSwitch tracks which vNIC is interested in which multicast groups, it delivers packets only to right set of VMs. In this way the vSwitch does not deliver packet to all VMs but only to vNICs that match the forwarding table lookup.

If a VM leaves a multicast group, it will send a IGMP leave message which will be forwarded to the pSwitch and then removes the multicast MAC address from it’s vNIC to stop receiving the stream. The vSwitch will then remove the vNIC from the forwarding table for that multicast group. If the VM in question was the last one on the ESXi Server that had requested the multicast group, the pSwitch will also remove the group from the list of multicast groups to send out of the physical port.

What if the VM is vMotioned?

When a VM is vMotioned, it’s vNIC configuration goes with it. The destination hosts sees this vNIC configuration and updates it’s forwarding tables to forward the necessary multicast traffic it receives to the VM. To prevent any transient multicast packet loss after a vMotion, the vSwitch also injects an IGMP query into the VM, using its unicast MAC address, so that multicast receiver presence is known to the pSwitches immediately. This avoids the VM missing multicast traffic by having to wait for next IGMP query to come from a IGMP querier on the network.

The IGMP querier is usually a router on the network and is required in order for IGMP snooping to work on pSwitches. The pSwitches use this information in their multicast forwarding tables and without it would not be able to do IGMP snooping. The routers send out IGMP queries to address 224.0.0.1, all-systems multicast group, and the VMs that have subscribed to a multicast group respond with a membership report listing the groups they are participating in. The pSwitch snoops this information and updates its multicast forwarding tables to starts forwarding the multicast groups for the VM.

How about Physical NIC Teaming

Physical NIC teaming is supported but how it works is dependent on the type of load balancing scheme used.

If the physical NICs are all active and the teaming is virtual source port ID or MAC hash based, then the VM’s IGMP join messages will go out of the configured physical NIC and the corresponding pSwitch will update its multicast forwarding tables to send out the multicast group to the VM on the associated physical port.

For the case where one of the physical NICs starts out in standby mode and VMs are failed over to it. The vSwitch will, like in the vMotion case above, inject IGMP queries into the VMs affected by the failover so that multicast receiver presence is known to the pSwitch immediately to allow packets forwarding.

In the case of link aggregation that uses IP hash for load balancing, the pSwitch treats the pNICs as one channel and will fail the multicast traffic between the pNICs as they are all subscribed to the same groups. The pNIC used to send the multicast traffic to the vSwitch will depend on the pSwitch load-balancing scheme. Keep in mind that to use link aggregation with multiple pSwitches, the switches need to be a stack in order to look like a single switch to the ESXi servers.

In Closing

Multicast traffic is not one of those things that is talked about a lot in general virtualization implementations. There is great support for it now in vSphere with performance constantly being improved.

The main driving force behind multicast in virtualized environments is probably financial institutions that rely heavily on it for streaming of things like market data and video. With the use of 10GB NICs and performance improvements in multicast handling, there is now very little stopping the virtualization of even the most demanding of multicast based applications.

Nov 032011
 

There has been a lot of chatter in the bloggersphere about the advent of Virtual eXtensible Local Area Network (VXLAN) and all the vendors that contributed to the standard as well as those that are planning on supporting the proposed IETF draft standard.  In the next couple of articles I will attempt to describe how VXLAN is supposed to work as well as give you an idea of when you should consider implementing it, and how to implement it in your VMware Infrastructure (VI).

VXLAN Basics:

The basic use case for VXLAN is to connect two or more layer three (L3) networks and make them look like they share the same layer two (L2) domain. This would allow for virtual machines to live in two disparate networks yet still operate as if they were attached to the same L2.  See section 3 of the VXLAN IETF draft as it addresses the networking problems that VXLAN is attempting to solve a lot better than I ever could.

To operate a VXLAN needs a couple of components in place:

  • Multicast support, IGMP and PIM
  • VXLAN Network Identifier (VNI)
  • VXLAN Gateway
  • VXLAN Tunnel End Point (VTEP)
  • VXLAN Segment/VXLAN Overlay Network

VXLAN is an L2 overlay over an L3 network. Each overlay network is known as a VXLAN Segment and identified by a unique 24-bit segment ID called a VXLAN Network Identifier (VNI).  Only virtual machine on the same VNI are allowed to communicate with each other.  Virtual machines are identified uniquely by the combination of their MAC addresses and VNI.  As such it is possible to have duplicate MAC addresses in different VXLAN Segments without issue, but not in the same VXLAN Segments.

 

Transport header format

VXLAN Transport Header Format

Figure 1: VXLAN Packet Header

The original L2 packet that the virtual machines send out is encapsulated in a VXLAN header that includes the VNI associated with the VXLAN Segments that the virtual machine belongs to.  The resulting packet is then wrapped in a UDP->IP->Ethernet packet for final delivery on the transport network.  Due to this encapsulation you can think of VXLAN as a tunneling scheme with the ESX hosts making up the VXLAN Tunnel End Points (VTEP).  The VTEPs are responsible for encapsulating the virtual machine traffic in a VXLAN header as well as stripping it off and presenting the destination virtual machine with the original L2 packet.

The encapsulation is comprised of the following modifications from standard UDP, IP and Ethernet frames:

Ethernet Header:

Destination Address – This is set to the MAC address of the destination VTEP if it is local of to that of the next hop device, usually a router, when the destination VTEP is on a different L3 network.

VLAN – This is optional in a VXLAN implementation and will be designated by an ethertype of 0×8100 and have an associated VLAN ID tag.

Ethertype – This is set to 0×0800 as the payload packet is an IPv4 packet.  The initial VXLAN draft does not include an IPv6 implementation, but it is planned for the next draft.

IP Header:

Protocol – Set 0×11 to indicate that the frame contains a UDP packet

Source IP – IP address of originating VTEP

Destination IP – IP address of target VTEP.  If this is not known, as in the case of a target virtual machine that the VTEP has not targeted before,  a discovery process needs to be done by originating VTEP.  This is done in a couple of steps:

    1. Destination IP is replaced with the IP multicast group corresponding to the VNI of the originating virtual machine
    2. All VTEPs that have subscribed to the IP multicast group receive the frame and decapsulate it learning the mapping of source virtual machine MAC address and host VTEP
    3. The host VTEP of the destination virtual machine will then send the virtual machines response to the originating VTEP using its destination IP address as it learned this from the original multicast frame
    4. The Source VTEP adds the new mapping of VTEP to virtual machine MAC address to its tables for future packets

UDP Header:

Source Port – Set by transmitting VTEP

VXLAN Port – IANA assigned VXLAN Port.  This has not been assigned yet

UDP Checksum – This should be set to 0×0000. If the checksum is not set to 0×0000 by the source VTEP, then the receiving VTEP should verify the checksum and if not correct, the frame must be dropped and not decapsulated.

VXLAN Header:

VXLAN Flags – Reserved bits set to zero except bit 3, the I bit, which is set to 1 to for a valid VNI

VNI – 24-bit field that is the VXLAN Network Identifier

Reserved – A set of fields, 24 bits and 8 bits, that are reserved and set to zero

Putting it Together:

 

VXLAN: VM to VM communication

VXLAN: VM to VM communication

Figure 2: VM to VM communication

When VM1 wants to send a packet to VM2, it needs the MAC address of VM2 this is the process that is followed:

  • VM1 sends a ARP packet requesting the MAC address associated with 192.168.0.101
  • This ARP is encapsulated by VTEP1 into a multicast packet to the multicast group associated with VNI 864
  • All VTEPs see the multicast packet and add the association of VTEP1 and VM1 to its VXLAN tables
  • VTEP2 receives the multicast packet decapsulates it, and sends the original broadcast on portgroups associated with VNI 864
  • VM2 sees the ARP packet and responds with its MAC address
  • VTEP2 encapsulates the response as a unicast IP packet and sends it back to VTEP1 using IP routing
  • VTEP1 decapsulates the packet and passes it on to VM1

At this point VM1 knows the MAC address of VM2 and can send directed packets to it as shown in in Figure 2: VM to VM communication:

  1. VM1 sends the IP packet to VM2 from IP address 192.168.0.100 to 192.168.0.101
  2. VTEP1 takes the packet and encapsulates it by adding the following headers:
    • VXLAN header with VNI=864
    • Standard UDP header and sets the UDP checksum to 0×0000, and the destination port being the VXLAN IANA designated port.  Cisco N1KV is currently using port ID 8472.
    • Standard IP header with the Destination being VTEP2’s IP address and Protocol 0×011 for the UDP packet used for delivery
    • Standard MAC header with the MAC address of the next hop.  In this case it is the router Interface with MAC address 00:10:11:FE:D8:D2 which will use IP routing to send it to the destination
  3. VTEP2 receives the packet as it has it’s MAC address as the destination.  The packet is decapsulated and found to be a VXLAN packet due to the UDP destination port.  At this point the VTEP will look up the associated portgroups for VNI 864 found in the VXLAN header.  It will then verify that the target, VM2 in this case, is allowed to receive frames for VNI 864 due to it’s portgroup membership and pass the packet on if the verification passes.
  4. VM2 receives the packet and deals with it like any other IP packet.

The return path for packet from VM2 to VM1 would follow the same IP route through the router on the way back.

 

Mar 232011
 

Introduction:

Networking in VMware Cloud Director (vCD) is the most complex and least understood component of its architecture.  Among the different networking options was the introduction of vCD Network Isolation (vCDNI).  In this blog I am going to attempt to give you a better understanding of vCDNI and how it works technically.

Background:

vCDNI is an overlay networking scheme that builds on top of an existing Ethernet network to provide isolated networking for Virtual Machines.  An overlay network is a virtual network of devices or connection that rides on top of an existing networking infrastructure.  An example would be a data connection that rides on top of a voice connection in a telephone network.

The scheme used for vCDNI is a MAC-in-MAC encapsulation that allows for many isolated networks, up to 16 million in vCD 1.0, to be created and isolated on top of one Ethernet network.

If you are familiar with Lab Managers networking, vCDNI is based on the same technology used in Host Spanning Private Networks found in LabManager 4. See LabManager4_Whats_New.pdf for details on this.  VM traffic encapsulation is done in the fast data path by a vCDNI dvfilter operating in the vmkernel.  There is no VM operation required for this.  LM required a service VM to be present in the control path whereas VCD does not require such a VM.

Technical Breakdown:

vCDNI networks are implemented as portgroups in vSphere with each network designated a unique network ID, also know as a Fence ID.  In vCenter the portgroup would have a name like:

dvs.VC1055813345DVS2CM1-V99-F12-Emca Internet

The breakdown of the portgroup name is:

VC1055813345      vCenter ID
DVS2              Distributed switch number
CM1               vCD Instance
V99               Transport VLAN
F12               Network (Fence) ID
Emca Internet     vCD Network Name

vCD assigns a unique Network ID to each vCDNI network and uses a different vSphere ephemeral portgroup for each network. With ephemeral portgroup, a virtual port will be created and assigned to a Virtual Machine when it is powered on, and will be deleted when it is powered off. An ephemeral portgroup has no limit on the number of ports that can be a part of this portgroup other than the vCenter supported limit.  As such the practical upper bound for concurrent vCDNI networks is the max supported ephemeral portgroups in vCenter, currently 1016 in vSphere 4.1.

The isolation inherent in vCDNI is achieved by re-encapsulating the original MAC frames from Virtual Machines in a vCDNI frame to create a new MAC-in-MAC frame. Part of the new vCDNI header is a network ID that identifies which isolation network the packet belongs to.  The final frame looks like this:

The vCDNI MAC header has the assigned virtual MAC addresses of the destination ESX(i) server and identifying information of the source ESX(i) server corresponding to the destination and source Virtual Machines respectively.

The vCDNI data header contains vCDNI protocol specific data such as sequence and version numbers and vCDNI Network ID.

The following example depicts the vCDNI packet flow:

  • Virtual Machine sends packet out of it’s virtual NIC
  • dvfilter adds vCDNI MAC and Data headers to create the overlay vCDNI network
  • DVS adds the transport VLAN tags to the packet, if needed, and passes it on to the physical NIC
  • Destination ESX server Physical NIC gets the packet
  • DVS strips off the transport VLAN tags and passes it up to the dvfilter
  • dvfilter strips the vCDNI protocol data and passes the packet on to the destination VM

 

The added vCDNI header increases the maximum size of the frame by 24 bytes.  To ensure that there is no fragmentation due to oversized frames, the Ethernet Maximum Transmission Unit (MTU) needs to be increased by 24 bytes from its default of 1500 bytes to 1524 bytes. The MTU increase needs to be implemented end to end. This MTU change needs to be done in three places:

  • At the vCD level, by changing the Network Pool MTU

  • At the vCenter level, by modifying the advance properties of the DVS hosting the Network pool

  • At the physical switches, depending on the switch implementation, by increasing the MTU of the switch or ports used by the DVS hosting the network pool

a.  On a Cisco Catalyst switch

Catalyst_Switch(config)# system mtu 1524
Catalyst_Switch(config)# exit
Catalyst_Switch# reload

b.  On a Cisco Native IOS switch

NativeIOS_Switch># int gigabitEthernet 1/1
NativeIOS_Switch># mtu 1524

Implementation Consideration:

vCDNI is the optimal option in any environment where you need to create Virtual Machine networks without consuming VLANs in the process.  I would recommend that vCDNI network pools back all Organization and vApp networks. (See my previous blog on the definitions).  This would allow for the definition of thousands of networks without the actual consumption of VLANs, other than for the transport network.

By implementing the vCD dynamic portgroups on an “Organization” DVS and the static portgroups on a “Provider” DVS you will be able to isolate the dynamic portgroups to just the “Organization” DVS and make the MTU changes only on the infrastructure supporting the “Organization” DVS.  Doing this will also allow you to maximize the number of ports you can create in one vCenter as the “Organization” DVS will use ephemeral portgroups and the “Provider” DVS will use dynamic, preferably, or static portgroups.

Security Consideration:

The vCDNI transport network needs to be an isolated set of switches or a dedicated non-routed VLAN.  For practical reasons a dedicated non-routed VLAN will scale much better in a large vCD deployment.

The argument behind the isolated switches or non-routed VLAN recommendation is that vCDNI does not encrypt its packets and just overlays the different isolated networks on top of the transport network.  If a physical machine, or Virtual Machine for that matter, has access to the transport network, it would be trivial to disassemble the packets, as well as inject packets into any of the isolated networks.  This is not unique to vCDNI though, as Ethernet packets are not encrypted either, making it trivial to spoof Ethernet packets if you have access to the transport network.  The same applies to VLANs by the way, as access to the VLAN transport network (Native VLAN of the trunk) will allow you to inject packets into any VLAN allowed on that trunk.

If you are not using vCDNI backed pools, I would strongly recommend that you isolate all vCD networks by using unique VLANs for each network to ensure tenant isolation.

It is very important to understand that one of the biggest risks to the virtual environment is misconfiguration, not the technology. Thus you need strong audit controls to ensure that you avoid misconfiguration, either accidental or malicious.

Summary:

The picture below shows logically where in the traffic flow the dvfilter is inserted

To get per host vCDNI statistics use:

/opt/vmware/vslad/fence-util

This utility can be used to display:

  1. Configured networks and their MTUs
  2. Active ports and their port IDs
  3. Switch state including inside and outside MAC addresses
  4. Port statistics on a port ID basis

For help with the command, type it with no options.

Acknowledgements:

This blog would not have been without the technical assistance of Anupam Dalal  and feedback of Michael Haines.