Friday, December 16, 2005

Graphics and I/O virtualization

A colleague of mine just pointed out a neat screenshot from Jacob Hansen, whom I had the pleasure of meeting at SOSP. Rich virtualization of I/O in general, and graphics in particular, remains a pretty wide open area. While Workstation 5.5 includes some basic support for passing the host's video hardware through to the guest, the user experience is still recognizably VMware-esqe; i.e., the guest runs in a big opaque rectangle inside a GUI, or in full-screen mode, but the 'in between' possibilities are largely unexplored.

There's also an instructive discussion taking place over on Xen-devel. One of the interesting things about graphics cards is that they present a relatively natural place to exploit paravirtualization techniques; since all OS'es have well-defined graphics driver APIs, and graphics performance has a huge impact on the user experience, the risk/reward of "true virtualization" (e.g., deciding to precisely emulate a Matrox Flibbertigibbet 2000) vs. pseudo-virtualization (emulating some made-up device that makes getting bits out of the guest and onto the host display hardware easy and fast) strongly favors the latter. The same is true, albeit to a lesser extent, with storage controllers and networking adapters.

The downside of making up your own hardware is that guest OS installers can't find a driver for your pseudo-device; since running an installer is often the first experience a user has with VMware, we want to get that guest installed, on the web, and generally looking smooth as quickly as possible. For that reason, we have a "morphing" network card. The network card initially looks like an AMD Lance, and can be driven by the stock Lance drivers available with everything after Windows for Workgroups or so. However, once you install the VMware tools in the guest, we install a driver that colludes with the Lance, to use a more streamlined, paravirt style I/O path.

DTrace for Linux! Sort of!

My homey/colleague at Sun Adam Leventhal has posted a barn-stormer of an article about the multiplicative power of DTrace and Solaris' new Linux binary compatibility layer (somewhat unfortunately termed "BrandZ"). Adam's post follows the usual script for DTrace demos:
  1. Take some innocent user-level application that mostly performs ok.

  2. Demonstrate a head-scratching anomoly in this mostly-ok performance.

  3. Make the developers look like total idiots.
This last item is a natural consequence of being able to see more deeply into the dynamic behavior of a software system than its creators. Similarly, working on a virtual machine monitor one sees a good deal of unintentional behavior with serious performance consequences. This is not because the people working on top, or glibc, or the Linux kernel for that matter, are idiots. Far from it. (Mostly.) They're just working with inferior tools, like biologists before the invention of the microscope. Creating quality software is one of the most serious untackled technical challenges out there; to the extent that DTrace makes this more possible, it's a real reason to use Solaris.

I've got to hand it to Sun; circa 2000, I thought OS'es were a solved problem. Yeah, we beat our chests over Solaris or Linux or Windows or AIX or whatever, but they're all basically the same junk: processes, users, threads, networking, multiprocessors, filesystems, virtual memory, linkers, etc. While all that stuff is blindingly hard to get right, impelementing it is, at some level, a simple matter of programming. So, props to the boffins at Sun for actively fighting commodification: ZFS and DTrace are real reasons to favor Solaris over other OS'es! For the first time since Mac OS X shipped, there are legitimate, technical reasons for an OS to claim fundamental superiority. Now BrandZ helps overcome excuses for not using Solaris. Of course, Solaris is free, so how they turn this into cash flow is another question.

(And no, smart-alecks, I don't own any SUNW stock or derivatives. Err, I guess I probably do through some index fund somewhere. But you get the point; I'm not randomly text-messaging strangers that "SUNW is gng to $25+++++!!!!!111".)

Lest I get too overheated here, my favorite Linux application won't run under BrandZ, presently. The kernel-level portion of VMware hasn't been ported to Solaris. However, there's at least conceptual hope. We ship the source to vmmon along with the Linux hosted products-- we have to, because there's no such thing as a Linux kernel ABI. Back in the workstation 2.0 days, some intrepid FreeBSD folks took it upon themselves to port vmmon to FreeBSD, and were able to use FreeBSD's linux compatibility layer to run the VMware binary. Perhaps some similarly motivated OpenSolaris folks will get around to doing something similar?

Friday, December 02, 2005

Standing on the shoulders of giants.

Most folks realize that IBM, DEC, and several other old-school computer manufacturers were thoroughly exploring virtualization around the time my generation was thoroughly exploring our mothers' wombs. Working at VMware for the last six years, I've been constantly aware that much of the ground we're covering was well-worn decades ago. Still, at times, the fidelity of the echos through the generations amazes me.

I've written a lot here about VT, Intel's recently shipped CPU virtualization hardware. I'm pretty intimately familiar with VT's gory guts, as well as those of AMD's Pacifica, after spending a good chunk of my career at VMware extending our VMM to support them. (Sorry to disappoint, AMD/Intel fanboys: the two specifications are pretty much exactly the same, with different instructions and in-memory layouts, etc.)

I was also faintly aware that IBM had done some work to accelerate virtualization "back in the day." But, I was utterly shocked at the familiarity of this paper (Osisek, Jackson, Gum, in IBM Systems Journal, March, 1991). It describes interpretive execution, which is IBM's name for the S/390 virtualization acceleration hardware. What's fascinating is that "interpretive execution" so closely resembles Pacifica, and in turn VT, that you can mechanically translate among them.

What VT calls "non-root mode", and Pacifica calls "guest mode", is known as "interpretive execution" (which, by the way, joins a long list of nuttily technical-sounding, yet completely non-descriptive terms that I associate with IBM; it's right up there with "translation lookaside buffer"). VT's "vmlaunch" instruction is Pacifica's "vmrun" is s/390's Germanic-flavored "sie"; Intel's "VMCS" is AMD's "VMCB" is IBM's "state description" (another hilarious IBM-ism).

The paper also provides something of a crystal ball, describing some interesting extensions that haven't made their way into the x86 vendors' hardware just yet: hardware support for a second level of address translation to support the paged MMU within the guest (which is described, albeit briefly, in the Pacifica spec), hardware SMP guest support (though this might be IBM-specific; it seems to be oriented towards implementing a semi-magical "tlb shootdown" instruction that has no analog on the x86); and I/O acceleration (though again, who knows how applicable this will be to the modern world; the described facilities seem oriented entirely towards pass-through of physical devices, and then only for a single, blessed guest on a given host, which, as Xen demonstrates, can already be implemented today). Everything old is new again...