Tuesday, May 25, 2010

Hyper-V: .vhd, Direct iSCSI, and SCSI Passthrough

Many of my peers have debated the three basic storage device connectivity options for Hyper-V for many months. After much debate, I decided to jot-down some ideas to directly address concerns regarding SCSI-passthrough vs. iSCSI in-guest initiator access vs. VHD. I approach the issues from two vantage points, then make some broad generalizations, conclusions, and offer my sage wisdom ;-)
  1. Device management
  2. Capacity limitations
  3. Recommendations

Device management:
  • SCSI-passthrough devices are drives presented to the parent partition — they are assigned to a specific child VM; the child VM then “owns” the disk resource. The issues that come from this architecture have to do with the “protection” of the device. Because not ALL SCSI instructions are passed into the child, array-based management techniques cannot be used. EMC Replication Manager, for example, cannot snap/clone the LUN because it cannot effectively communicate with the child VM. On the other hand, array-based replication technologies CAN still be used. For example, the SCSI-passthrough device can be failed-over to a surviving Hyper-V node — either locally for High Availability or remotely for Disaster Recovery. Both RecoverPoint and MirrorView support Cluster Enabled automated failover.
  • ...and now the rest of the story — Both Fibre Channel and iSCSI arrays can present storage devices to a Hyper-V parent, however differences is total bandwidth ultimately divide these two technologies. iSCSI is dependent on two techniques for increasing bandwidth past the 1Gbps (60MB/s) connection speed of a single pathway: 1.) iSCSI Multiple Connections per Session (MCS) and 2.) NIC-teaming. Most iSCSI targets (arrays) are limited to 4-iSCSI pathways per controller. When MCS or NIC-teaming is used, the maximum bandwidth the parent can bring to its child VMs is 240MB/s — a non-trivial amount, but 240MB/s is a "four NIC total -- for the entire HV node -- not just the HV child! On the other hand (not the Left Hand...), Fibre Channel arrays and HBA’s are equipped with dual-8Gbps interfaces — each interface can produce a whopping 720MB/s of sustained bandwidth when copying large block IO. In fact, 8Gbps interfaces can carry over 660MB/s when carrying 64KB IOs and slightly less as IO sizes drop to 8KB and below. When using Hyper-V with EMC CLARiiON arrays, EMC Powerpath software provides advanced pathway management and "fuses" the two 8GBps links together — bringing more than 1400GBps to the parent and child VMs. In addition, because FC uses a purpose-built lossless network, there is never competition for the network, switch backplane, or CPU.
  • iSCSI in-guest initiator presents the “data” volume to child VMs via in-parent networking out to an external storage device — CLARiiON, Windows Storage Server, NAS device, etc. iSCSI in-guest device mapping is Hyper-V’s “expected” pathway for data volume presentation to virtual machines -- it truly offers the richest "features" from a storage perspective -- Array-based clones and snaps can be taken with ease, for example. With iSCSI devices, there are no management limitations for Replication Manager: snaps and clones can be managed by the RM server/array. Devices can be copied and/or mounted to backup VMs, presented to Test/Dev VMs, and replicated to DR sites for remote backup.
  • ...and now, the rest of the story — an iSCSI in-guest initiator must use the CPU of the parent in order to packetize/depacketize the data from the IP stream (or use the dedicated resources of a physical TCP Offloading NIC placed in the HV host) — this additional overhead is usually not noticed, except when performing high IO operations such as backups/restores/data loads/data dumps -- keep in mind that Jumbo frames must be passed from the storage array, through the network layer, into each guest. Furthermore, each guest/child must use 4 or more virtual NICs to obtain iSCSI bandwidth near the 240MB/s target. The CPU cycles an in-guest initiator can consume are often 3-10% of the child’s CPU usage — the more child VMs, the more parent CPU will be devoted to packetizing data.
Capacity limitations:
  • VHDs have a well-known limit of 2TB, iSCSI and SCSI-passthrough devices are not limited to 2TB, and can be formatted for 16TB or more depending on the file system chosen. Beyond Hyper-V’s use of three basic VM connectivity types, there is the concept of the Clustered Shared Volume (CSV). Multiple CSVs can be deployed, but there primary goal for Hyper-V is to store virtual machines, not child VM data. CSVs can be formatted with GPT and allowed to grow to 16TB.
  • ...and now, the rest of the story -- Of course, in-guest iSCSI and SCSI Passthrough are exclusive of CSVs. VHDs can sit on CSV, but CSVs cannot present "block storage" to a child. Using a CSV implies that nothing on it will be more than 2TB in size. Furthermore... at more-than 2TB, recovery becomes more important than the size of the volume. Recovering a >2TB device at 240MB/s, for example, will take as little as 2.9 hours and usually as much as 8.3 hours — depending greatly on the number of threads the restoration process can run. >2TB restorations can take more than 24 hours if threading cannot be maximized. To address capacity issues related to file serving environments, a Boston-based company called Sanbolic has release a file system alternative to Microsoft’s CSV called Melio 2010. Melio is purpose-built to address clustered storage presented to Hyper-V servers that serve files. Meilo is multi-locking, and provides QoS, and enterprise reporting. http://www.sanbolic.com/Hyper-V.htm Melio is amazing technology, but honestly does nothing to "fix" the 2TB limit of VHDs.

Conclusion/Recommendations
  • iSCSI in-guest initiators should be used where cloning and snapping of data volumes is paramount to the operations of the VM under consideration. SQL Server and Sharepoint are two primary examples.
  • FC-connected SCSI devices should be used when high bandwidth applications are being considered.
  • Discrete array-based LUNs should always be presented for all valuable application data. Array-based LUNs allow cluster failover of discrete VMs with their data as well as array-based replication options.
  • CSVs should be used for "general purpose" storage of Virtual Machine boot drives and configuration files.
  • sanbolic Melio FS 2010 should be considered for highly versatile clustered shared storage.