Sunday, October 30, 2005

Gems of SOSP

Well, it was certainly a stimulating conference. There was a wealth of interesting and well-presented research. Here are some of the papers that interested me most. Insert disclaimer here: I'm just some guy who writes code for a living, what do I know, I'm sure I missed the point of your amazing paper, etc.

Vigilante (Costa et al., Microsoft Research) is a system for automatically detecting internet worms and stopping their spread. While network security is not a field in which I can even attempt to sound intelligent, the paper and talk were very convincing; "OK," I thought at the talk's conclusion, "go ahead and implement this and fix the internet." See an intriguing blog post from someone who might know what they're talking about...

IntroVirt (Joshi, King, Dunlap, and Chen, University of Michigan) is another clever application of virtualization from Peter Chen's group at University of Michigan. They've instrumented user-mode linux so that user-level processes within the UML guest can be instrumented with arbitrary code injected by the administrator of the VM. They present this system in a very narrow light: they use it to detect intrusions in known-vulnerable, but as-yet-unpatched software. It seems to me that the ability to run arbitrary code on arbitrary events in the guest is more generally powerful than this; I suspect the '*Virt*' crowd at Michigan has realized this, too.

Honeyfarm (Vrable et al., UCSD) is another intriguing application of virtualization. The described system uses a virtual machine "fork" primitive to create the external appearance of a very large network of vulnerable "honeypot" machines. Like the UNIX system call, this fork primitive returns twice, once in the calling virtual machine, and once in a newborne virtual machine that is in most respects a copy of its parent. Exploiting copy-on-write techniques and the fact that very few IP addresses are usually active at any given time allows them to achieve a very high virtual-to-physical resource ratio.

Rx (Qin, Tucek, Sundaresen, Zhou, UCSD) is a slightly quirky take on recovering from software failures. They observe that many software failures are easy to observe after the fact (e.g., they cause a SEGV), and that they are frequently caused by a small set of programmer errors (buffer overflows, timing assumptions, uninitialized variables, etc.). Rx does a process-level checkpoint of a server at connection establishment time, and, in the case of failure, reverts to the checkpoint, randomly perturbing the execution environment in the hopes of perturbing away the failure. (A proxy intermediates between client and server to provide the illusion of seamlessness.) According to the paper, this works much better than intuition suggests is possible. It's sort of like failure-oblivious computing, but without the, err, obliviousness.

Tuesday, October 25, 2005

House -- The Haskell OS

I'm a systems guy. In my professional life, as per this blog's title, I necessarily am writing software at a pretty grubby level of abstraction, typically in C and assembly. So, it often surprises folks that I carry a torch for functional languages in general, and Haskell in particular. It all dates back to my wasted youth at OGI. I was porting Linux to the i960 on behalf of an active networking project that, to the best of my knowledge, never really got off the ground. My cube was adjacent to the warrens of OGI's fanatical Haskell lovers. While I was too preoccupied with my work at the time, the sheer wild-eyed passion these folks felt about Haskell made a strong impression on me.

Then, they got one of my friends. Michael did a substantial programming project in Haskell, and came away convinced that this was how software ought to be. While I've never created real software in Haskell, I've taken great pleasure in, e.g., fooling around with Haskore. I've often idly wondered whether it made sense to contemplate doing systems programming in Haskell; well, those crazy folks across the way at OGI have beaten me to it. The House poster at SOSP was one of the more popular exhibits, and I for one spent a good twenty minutes or so sputtering half-formed questions such as, "so, like, you can use like, eval in your network drivers!", after which my head exploded.

Update: well, I got the boot floppy image going inside a VM, albeit briefly before the guest hung. So there are some bugs to work out; big freaking deal. Seeing page-table manipulation code as gorgeous as this looks like a cold beer on a hot summer's day to me; and having the OS at large interact with it through this interface makes my heart sing. These guys badly need a e1000 or vmnet ethernet driver; any takers?

Monday, October 24, 2005

Minix3 Available as VM Image

Andy Tannenbaum's keynote at SOSP was more of the same, "Software is garbage! We should all be ashamed!" hand-wringing that's become fashionable of late. A good chunk of his speech was concerned with plugging the new version of Minix. You might be wondering what's changed with Minix. I am too. T'baum's speech was an unreconstructed rehash of the arguments on behalf of microkernels straight out of 1993: the system will survive driver crashes, because the drivers are just user-level processes!

One of the formats in which minix 3 can be downloaded is as a VMware VM ready to run in the VMware player. I downloaded it to my laptop, and was absolutely gobsmacked by how fast this thing boots. Linux and Windows have conditioned me to think of a reboot as serious computation. It is also neat to see how understandable and clean the system's source is, and how rapidly it is possible to build the entire system from scratch. If you're an OS enthusiast, you owe it to yourself to fool around with this.

Sunday, October 23, 2005

Pioneer and Virtualization

The very first paper in SOSP '05 is a fine example of why a deep understanding of virtualization is now a necessity for doing systems work. A number of folks from CMU (Seshadri, Luk, Shi, Perrig, and Khosla) and one from IBM (van Doorn) describe an enterprising system they name "Pioneer." Pioneer aims to do what AMD's SVM and Intel's LT presume is impossible: reliably measure and attest modern, mostly unmodified kernels. It's a really tough problem, and their approach is novel. However, on reading their paper, I'm convinced VMware's virtual machine monitor is an existence proof of the system's vulnerability. The paper considers this possibility, and rejects it, based on some unfounded assumptions about modern VMMs.

Pioneer's central code-tampering prevention technique is external performance measurement: an external entity challenges the checksum code to attest to the state of the kernel. The external entity not only issues the challenge, but, using very nitty-gritty knowledge about the speed and microimplementation of the challenged host, issues the challenge with a time limit. Executing for longer than the time limit constitutes evidence of tampering, leading the external entity to consider the machine compromised. A corollary of this design is that the checksum function must be micro-architecturally optimal for the hardware on which it is run. If it leaves any spare execution resources unutilized, an attacker might use those parallel execution resources to paper over the holes that he has punched.

The paper goes into much greater detail, but here are some of their responses, paraphrased, to obvious objections:

1. How do you know that the checksum code is checksumming the data it ought to be checksumming? The data is checksummed in a pseudo-random order, and the data pointers themselves are inputs to the checksum. If the kernel is attacked, and the attacker wishes to continue answering challenges, it will have to store the "old code" somewhere to be checked. To manipuate the data pointers within the checksumming code will cause a decrease in performance, which can be detected by the external measuring agent.

Sounds good enough, but I disagree with the assumption that the "old data" and "new data" need to live at different virtual addresses. Why not create a different address space in which to keep the "old kernel" for measurement purposes, while keeping around a primary execution address space containing a compromised kernel? An attacker taking this strategy would intercept (using some memory-invisible technique, like the hardware breakpoint facility) the measurement code to change the pagetable pointer to the innocent-seeming address space on entry and exit. Since the address space switch is outside of the main loop of the checksum code, the odds that the external verifier will notice the performance difference are slim, as page table assignments are lost in the noise when compared to a network I/O. I will be curious to hear the authors' thoughts on this apparent hole in their system. The objection seems so simple and straightforward that I suspect there's something I'm just not getting.

2. What about VMMs? Here's where things get more interesting. Why doesn't running the guest kernel within a hostile VMM constitute a fool-proof attack method? The Pioneer folks consider this prospect in their paper:

"Since we assume a legacy computer system where the CPU does not have support for virtualization, the VM must be created using a software-based virtual machine monitor (VMM) such as VMware*. .... If the adversary tries to cheat by using a software VMM, then each read of the flags register will trap into the VMM or execute dynamically generated code, thereby increasing the adversary's checksum computation time."

...And since the system is very sensitve to changes in checksum execution time, the thinking goes, this attack will fail. They're assuming that writes to eflags.IF will "trap out" or "execute dynamically generated code", and that these will, without further argument, be slower than the native CLI/STI instructions. While this seems intuitively plausible, it's wrong, and I think it's wrong for interesting reasons.

The CLI and STI instructions on modern processors are quite slow, because they have very complex effects on the following instruction stream. For instance, there are OS'es whose idle loop only enables interrupts for a single instruction! All this interrupt stuff in general is exactly the sort of thing that gives today's deep and wide architectures fits, and STI and CLI times of more than 100 cycles are not unheard of on popular x86 implementations.

VMware uses multiple techniques for making forward progress in guest execution. When executing kernels, VMware often runs guest code via a purpose-built binary translator. It's an unusual binary translator, because, rather than translating between two unrelated architectures, it translates supervisor-mode x86 programs to user-mode x86 programs. This means that a whole lot of the dynamic runtime of our translator is spent in memcpy; after all, the vast majority of kernel instructions are the innocent loads, stores, branches, and ALU ops that don't require heavy intervention on the part of a VMM.

The Pioneer folks are right to guess that CLI and STI are not among those innocent instructions: we translate both of them into a small user-level program fragments. These translations basically load the software flags register, perform a single ALU operation on it, clearing or setting the interrupt flag, and then stores it back to memory. The interesting part of this all is that, since the translated code produced by VMware's VMM consists entirely of the bread-and-butter instructions that Intel and AMD go to great lengths to make fast, CLI and STI operations are much, much faster when executing in a VMware VM than on native hardware.

No. Really. Seriously.

Don't believe me? Try it yourself. Here's a kernel module that executes a simple STI benchmark:


int i;
unsigned oldjif = jiffies;

while (oldjif == jiffies)
; /* wait for start of new timer epoch */

for (oldjif = jiffies, i = 0; jiffies == oldjif; i++) {
__asm__ ("sti" : :);
printk(KERN_ALERT "bogosti: %d hz %d\n", i, HZ);
return 0;


On my 1.8Ghz pentium M laptop, running the VMware Workstation 5.5 release candidate under a Fedora Core 4 guest, our VMM executes about 200000 iterations in a single millisecond. That works out to around 200 million STIs per second, or something like 9 cycles per iteration. When you look at the code that GCC generated**, you realize that the VMware VMM is executing a STI, two ALU ops, a load, and a predictable branch in about 9 cycles. While I don't have a Linux host on which to test***, I'll go out on a limb and claim we're probably beating hardware by something like 5-10x on this microbenchmark.

Of course, this is a special case. Virtualization can't make all guest kernel code faster, by the old "you can't fit a gallon of milk in a pint bottle" counting argument. But, it's an interesting real-world demonstration of the dangers of making plausible assumptions about performance, especially when it comes to virtualization. Be careful! And if you're resting your security arguments on performance properties of a virtual machine, measure, rather than assert.

* Sic. The first sentence is probably referring to VT and Pacifica, the coming hardware virtualization technologies from Intel and AMD respectively. However, note that in both of those systems, VMs are still "created using a software-based VMM"; VT and Pacifica don't get rid of your software VMM, folks. If they work as planned, they make your VMM radically more simple than VMware's. But the VMM is still sitting there, a big chunk of software that can deceive the guest into thinking almost anything it wants. Note that I take no responsibility for this abuse of our trademark. VMware is a company that makes a VMM among many other things; VMware is not a VMM.


41: fb sti
42: 83 c1 01 add $0x1,%ecx
45: a1 00 00 00 00 mov 0x0,%eax
4a: 39 d0 cmp %edx,%eax
4c: 74 f3 je 41

*** Man, there goes all my UNIX street cred! It's true, it's true; I don't have enough free time to keep WIFI under Linux working. I'll repent, I swear it.

The Inimitable Stimulation of the Foreign

There's nothing quite as exciting as being in a foreign country. A thousand quotidian details of ones life (how showers work; the direction doorknobs turn; the side of the street on which one drives; the convention about which direction light switches flip) suddenly require conscious effort. The sharp relief thrown on all these details makes one realize just how peculiar any one place's customs are, and fills the mind with the possibilities of the ways things could be, but perhaps just haven't gotten around to being so yet. There's nothing like it.

Being in a foreign country where I actually speak the language is completely new to me. I must have almost killed myself five times this morning stepping off the curb after looking the wrong way. And yet, I can somewhat effortlessly communicate complex, even technical thoughts to the locals, albeit not without betraying my identity as a yank tourist (as if pulling out my camera every 30 seonds to shoot nothing in particular hadn't already given me away).

This morning was a beautifully crisp, see-your-breath-but-still-sweater-without-a-jacket autumn morning, which I whiled away taking in the sights, sounds, and smells of Brighton, the gorgeous seaside vacation destination for wealthy Londoners which is hosting SOSP 2005. My incredibly shallow research, and the incredibly tacky website linked above, had led me to expect a somewhat small town, but my morning walk around revealed a profoundly cosmopolitan retail economy, to the point of mild yuppification. Lots of world food restaurants, continental-style cafes, FCUK storefronts, etc. Before I really knew what had happened, I'd walked for about three hours in this beautiful burg.

I could complain about the usual travel headaches, (e.g., the Brit who got off our plane after getting on in SFO, leading to a typically post-9/11 security freakout that saw us departing an hour late) etc., but what's the point? I'm here now. So far, copious amounts of the delicious espresso is keeping jet lag, and what ought to be a substantial Old Speckled Hen hangover, at bay. We'll see how long that lasts.

I'm off to wander the streets some more. I could do that whole "blogger" thing and tote this poor laptop to a coffee shop, but it seems like it just isn't done here. We'll see if I can scare up the courage to break this convention...

Friday, October 21, 2005

Cor Blimey, Guv!

I'm going to SOSP. I plan to be blogging the ever-living heck out of that mutha. I've never been to the UK before, so expect some amount of touristic, bewildered-Yank-abroad prose to boot.

VMware Makes VM "Player" App Free

My employer is going to give away a free application for running VMware virtual machines. Here's Slashdot's reaction, which seems broadly positive. I think if you're a user or developer of a minority OS, this has to be an exciting turn of events. VMware virtual machines are a much, much more attractive delivery vehicle for new OS'es than ISO images. If I'm some Swedish grad student with a hobby OS, I probably am hosting a web page with instructions like:
  1. Download this ISO image.
  2. Burn it to CD-ROM.
  3. Blow the mind of some poor computer sitting in the corner.
  4. Don't forget to back up! Oh, wait, is it too late for that?
Users are much more likely to give it a shot if you just download and double-click a VM image: no rebooting, no burning, no fuss, no muss. I certainly would never have tried Ubuntu, or Solaris 10 were it not for the hygienic nature of VM-based installs.

Monday, October 17, 2005

UPenn Course on Virtualization

Zachary Ives, E Christopher Lewis, and Milo Martin are teaching a survey course on machine virtualization. The paper collection works its way up from a worthwhile introductory material ( J. E. Smith and Ravi Nair) all the way up to some true exotica (Phil Levis and David Culler: hi, Phil!).

When I started working at VMware in 2000, I was coming straight from undergraduate education and a brief stint as a research assistant. At that time, virtual machines were very much at the fringes of academic interest. Each year, with every passing conference, virtualization has become hotter and hotter. I admit, at this point, that virtualization has probably become a bit of a fad; people apply the term "virtualization," or "virtual machine," to software constructs that could more easily be construed as microkernels, or APIs, or libraries, or P-code, or what have you, because those ideas seem old and busted, while "virtualization" is the new hotness. Our time in the sun will pass, and some of those papers will, in time, look just as head-scratchingly weird as some of the farther-out exokernel papers from the late '90's. Still, it's been really gratifying to see the level of mainstream interest in an idea that's always fascinated me. I'd like to think that my work has, in some small way, contributed to this upswell in interest in virtualization. By showing that full-system virtualization could be practical and efficient, even in a hostile environment, I think VMware helped point the way to this floodgate of interesting ideas.

Lest I let my head get too big: IBM pretty much did all this while my generation was but a gleam in our fathers' eyes. Credit where it's due...

Thursday, October 13, 2005

Linux NMIs on Intel 64-bit Hardware

Why are NMIs cool?

If you are running x86_64 Linux 2.6.x, grep for "NMI" in /proc/interrupts. This line exports a running tally of "non-maskable interrupts" on each CPU since system boot. Just what are these NMI thingies? What is Linux doing with them?

In the x86, "non-maskable interrupts" differ from regular old IRQs not so much in their maskability (they're pretty much maskable, just not by the same methods typically used for IRQs), but in their source (they are signalled to the CPU via a different line than IRQs) and semantics.

The architectural purpose for NMIs is to serve as a sort of "meta-interrupt;" they're interrupts that can interrupt interrupt handlers. This may sound ridiculous initially, but for a kernel developer, judicious use of NMIs makes it possible to port some of the luxuries of user-level development to the kernel. Consider, e.g., profiling. User-level apps typically use SIGPROF, which in turn is driven by the kernel's timer interrupt handler. But what if you're a kernel developer concerned with the performance of the timer interrupt handler itself?

NMIs provide one solution; by setting up periodic NMIs, and gathering execution samples in the NMI handler, you can peer into the performance of kernel critical sections that run with disabled interrupts. We've used this technique to good effect to study the performance of VMware's virtual machine monitor. The oprofile system-wide profiler on Linux leverages the same technique.

Another important application for NMIs is best-effort deadlock detection; an NMI "watchdog" runs perioically and looks for signs of forward progress (e.g., those counts of interrupts in /proc/interrupts rolling forward) has a decent chance of detecting most "hard" kernel hangs. 9 times out of 10, an NMI handler that detects a wedged system can't do much of use for the user. The system will crash, and often do so just as hard as if there were no NMI handler present; however, perhaps it will dump some sort of kernel core file that can be recovered after the inevitable reboot to aid kernel engineers in diagnosing the problem post-mortem. Even something as simple as pretty-printing a register dump and stack-trace to the system console provides a world of improvement in debuggability over a mute, locked-up box.

It's this last application that gets Linux excited. On x86_64, the Linux kernel defaults to building with an NMI watchdog enabled. If you cat /proc/interrupts on a 32-bit x86 system, you'll see the NMI line with a total of zero (unless you've compiled your own kernel with NMIs enabled). So, if NMIs are so nifty, why do we use them for x86_64, and not plain old i386? Good question. I'm not sure why the two architectures are treated differently. Perhaps because x86_64 is a bit more young, and the Linux kernel folks are more concerned with being able to debug hangs? Or perhaps there are architecture-specific differences in other parts of the kernel that make the watchdog less appealing for i386. I don't know.

Too much of a good thing?

So, let's get back to that NMI line in your /proc/interrupts file. If you tap your fingers for a few seconds between inspections of this file, you'll notice the NMI total increasing. However, the rate at which it increases will be dependent on your underlying hardware. If you're running linux-x86_64 on AMD hardware, you'll notice those NMIs ticking up at about 1Hz. This is convenient for the intended purpose; once a second is plenty frequent to check for something as (hopefully) rare as a hard system lock-up.

Now, try the same experiment with an Intel EM64T machine. You'll notice that the NMI interrupts are coming in much, much faster. If you do the math, you'll find they're coming in at 1000Hz, exactly the same rate as the timer interrupts. What gives? And why does Linux want 1000 times more of them on EM64T hardware than on AMD64 hardware?

The answers are buried in nmi.c:nmi_watchdog_default; for AMD64, the kernel uses on-chip performance counters as a source of NMIs, while for all other CPUs (namely, EM64T parts), it uses the timer interrupt. After an initial calibration phase, Linux throttles back the AMD NMIs to a rate of 1Hz. However, on Intel hardware, however, some unusual jiggery-pokey takes place in the legacy PIC and local APICs, so that the very same timer interrupt signal trickles into the kernel via two different routes: once as a normal interrupt, via the IOAPIC and the local APIC's intr pin, and again via the LINT0 line into each local APIC as an NMI. Since the signal generating the NMI is the timer signal, there's little Linux can do but run the NMI interrupt at the same frequency as the timer interrupt.

This arrangement presents a couple of problems. From the point of view of NMI consumers like the aforementioned oprofile, this partially subverts the purpose of NMIs in the first place; by heavily correlating the NMI handler with the running of a particular chunk of kernel code (namely, the plain-jane timer interrupt handler), the distribution of kernel samples can be skewed. This could badly impact the effectiveness of profiling applications (the profile samples would tend to hit near the same place).

There are also performance consequences to this use of the hardware. 64-bit Linux on Intel hardware performs worse than it has to. How much worse? Let's assume a typical P4 needs 1000 cycles at minimum to take an NMI, and execute an IRET instruction to return from it. Then, of course, the software presumably has some work to do, taking at least another 2000 cycles. (Yes, I'm pulling these figures from thin air, but I consider them lower bounds, given that the data and code for the NMI handler are most likely cool in the cache.) So, we've used up 3000 cycles 1000 times every second; on your 3GHz modern processor, that's about 0.1% of the processor's performance dedicated to checking for deadlocks. That figure might not sound damning, but when you consider the blood that kernel folks sweat trying to wring fractional percentages out of a single path, 0.1% shaved right off the top, independent of Amdahl's Law for just the price of a recompile is an absolute dream.

Where do I come in? Well, this NMI overhead is even more pronounced when running atop VMware's virtual machine monitor. The probe effect of NMIs is magnified inside a virtual machine, since we typically must emulate the vectoring of the NMI through the virtual IDT in software. But, what's worse, in SMP VMs, the hardware path Linux is using to deliver NMIs introduces bottlenecks. The PIC and LINT0 line, which Linux uses to deliver NMIs on Intel hardware are (and are constrained by the architecture to be) system-wide global entities, shared by all virtual CPUs in the VM; to manipulate them 1000 times a second induces lots of synchronization-related overheads. (And no, armchair lock granularity second-guessers, there's not just a big "LINT0 lock" we're taking over and over again; it's a cute little lock-free algorithm, but at the end of the day, you can only get so cute before you do more harm than good.)

You multiply these effects together (too many NMIs * NMI overheads in a VM * overheads of delivering NMIs via cross-VCPU hardware in a VM) and you end up with a trivially measurable effect on performance when running Linux in a
64-bit VM. Unfortunately, those second two factors aren't going anywhere; NMIs will have a higher impact in a VM for the foreseeable future, as will using global interrupt hardware like the PIC and LINT0 line. In the long term, the only real improvements are Linux's to make. Linux could either make do with fewer NMIs, as it manages to do on AMD hardware, or use the NMIs via some processor-local hardware, (again, like the performance-counter-based AMD implementation). Luckily, these changes will have real benefits for physical machines, too. It's just the Right Thing (TM). Too bad I don't have enough copious spare time to send Linus a patch; perhaps some of those oft-cited eyeballs have pairs of hands to go with them?

Update: Linux 2.6.12 fixed the NMI-storm-on-EM64T misbehavior chronicled here. Unfortunately, few distributions have picked up such a shiny, new kernel, so the weirdness documented above still affects the majority of users.

Wednesday, October 12, 2005

The Surprising Endurance of Self-Modifying Code

I recently had the pleasure of meeting a retrocomputing enthusiast, who has asked me to respect his anonymity. For personal amusement, I'm going to call him "Phil." Phil has done some work on software systems that run old video games -- Ms. PacMan, Asteroids, that sort of thing. While these software systems are often called "emulators", most of the good ones are a fair bit more sophisticated than the simple pseudocode that comes to mind when the term "emulator" is thrown around:
for (;;) {
unsigned char opcode = memory[CPU->programConter];
switch(opcode) {
Many of these systems are binary translators of one form or another. They take your Ms. PacMan ROM image, and produce a translation of Ms. PacMan retargeted to run natively on your machine. Phil had designed and built such a system, and we talked about its internals at some length. After a while, it dawned on me that his system would not be able to handle self-modifying code, and I became convinced I was missing something. Surely, if you're running these incredibly hairy machine-language programs that rely on such intimate machine details as the exact cycle counts of individual instructions, you'd run into lots of self-modifying code. If any emulator in the world has to get self-modifying code right, it would be an emulator for old video games, right? Right??!?

But no, Phil confirmed that the system was completely broken in dealing with self-modifying code. Yet, his system had no problem running all sorts of old games.

Why? In the lore of systems, self-modifying code is exactly the sort of bizaare space- and time-optimization that only makes sense in the semi-mythologically constrained environments of old computers. These games were extremely performance-critical, written in assembly language, under harrowing space constraints, often on 8-bit computers with a single general-purpose register. Yet they apparently didn't mutilate their program text by even a single byte.

After watching me squirm for a while, Phil let me off the hook. These are console games; since these systems would only ever run a single game, the code lived in ROM. It would have been prohibitively expensive to provide enough RAM to copy the code out of ROM, so self-modifying code was, ironically, a luxury unavailable to many old-time assembly language video game hackers, the very group with which most people associate self-modifying code.

Today, most developers regard self-modifying code as an occasionally interesting, but thankfully obsolete curiosity. After all, very little significant software is written in assembly anymore; even when it is, space-constraints are rarely what they used to be, and the performance argument would now go against self-modifying code, since it interferes with the instruction cache and pipeline on modern processors. Yet, if you peak under the hood of your running PC, today, in the year 2005, you'll find gobs of self-modifying code looking up at you. From dynamic linkers to JVM/CLRs, to various system instrumentation frameworks, to debuggers, to profilers, on and on ad infinitum, there's a whole heck of a lot of code getting rewritten in dribs and drabs on a modern system. So, whole-system monitors, like the one I work on, need to deal with self-modifying code correctly. In fact, code modification is so prevalent now that monitor engineers must worry not only about its correctness when running in a VM, but also its performance!

So, the next time you're bored around the coffee machine, bend some of your colleagues' minds by asking them which system is running more self-modifying code: a Z80 running Ms. PacMan, or their Windows XP laptop. As a rule of thumb, the more modern the system, the more self-modifying code you'll find.

Tuesday, October 11, 2005

Microsoft Changes Server Licensing in VMs

Microsoft plans to license windows on a per-running VM basis, rather than an absolute per-VM basis. On writing this, I'm not even sure what the latter means. When Callinicos contrasts the new licensing policy with the old, he says, "Instead of licensing every inactive or stored virtual instance of a Windows Server System product, customers can now create and store an unlimited number of instances, including those for back-up and recovery, and only pay for the maximum number of running instances at any given time." So, the old policy appears to be total madness: copy a virtual machine's disk file to a tape drive for backup, even if it happens automatically? Whoah, that's a brand new Windows installation! Better have a license for it!

While this change is probably for the better, not all Windows-in-a-VM customers will be dancing for joy. In the same way virtualization was inflating Windows licensing costs for some folks, for others it was depressing those costs. E.g., if you have a single read-only disk file, shared via a network mount and simultaneously running on N physical machines, my (completely ignorant, so don't take this to the bank or anything) understanding is that under the old licensing rules, you would have needed only a single license. Now, you'll be ponying up for N licenses. I'm not saying this to make Microsoft seem like bad guys; it's perfectly fair for them to expect N licenses, since the customer in this case is essentially getting N installations of Windows. But, most press coverage of this announcement seems to be assuming it's a godsend for customers running Windows in VMs; like so many other things in life, the answer to the question, "Is this a good thing?", is, "It depends."

Monday, October 10, 2005

More SystemTap Thoughts

I've read the SystemTap architecture paper. My initial reactions, which might be muddied by misunderstandings, misreadings, general ignorance, etc.:
  1. Every script execution invokes the GCC toolchain to produce a kernel module, then loads that module, executes for a time, and unloads the module. This makes implementation more tractable, because the SystemTap folks don't have to write a different back-end for every CPU, nor do they have to define a little bytecode VM for user-level to communicate with the kernel, as DTrace did. However, this compile-link-load cycle may take a user-perceptible chunk of time, especially if a script is invoked repeatedly from some other script. Worrying about this sort of incremental friction might seem like premature optimization, but when you rely for runtime performance on a big piece of software that is not optimized for runtime performance, such as the GCC compile/link cycle (which, after all, is optimized to produce fast binaries, not to produce binaries fast), you're throwing away a lot of flexibility right out of the gate.
  2. I'm not in love with the language. Like D, it uses C's expression syntax. However, unlike D, it doesn't use C's type syntax. This isn't just an inconvenience. D scripts can #include header files right out of a source tree, or the kernel, or wherever, and can use those types in a natural way. This can be very helpful when instrumenting a C application.
  3. On the other hand, when not instrumenting a C application, the architecture doesn't seem to anticipate external sources of probe points. They discuss being able to probe user-level applications, but only in terms of tracing specific program counters. This doesn't always make sense. If, e.g., the target application is a scheme interpreter, the programmer will want to interact with his program's control flow in terms of source-level function entry/exit, rather than random program counters within the interpreter. While the core functionality of SystemTap can be extended via "TapSets," it sounds like these tapsets are stuck on the wrong side of the application being probed to do this sort of thing well. (I.e., instead of the scheme interpreter publishing a semantic interface to its internals, the TapSet has to contain enough knowledge about the scheme interpreter to reverse engineer its current state.)
This last point, if I understand it right, is really unfortunate. It basically limits SystemTap's full powers to the kernel; applications can only be instrumented at the machine-language level. Some of the more powerfully convincing DTrace demos involve following an execution path all the way from application-level into the nittiest-grittiest kernel guts. The ability to telescope from the ethereally high level of a scripted language all the way down to the kernel grovelling around in the APIC and back again is one of the more exciting things about DTrace. I hope the SystemTap folks haven't given up on achieving this for Linux.

Or, of course, maybe I'm just not getting it. Perhaps some SystemTappers out there can set me straight?

SystemTap -- DTrace for Linux?

DTrace absolutely rocks. It is easily the most powerful general-purpose facility to come along in a long, long while. Skeptical? Give Adam's DTrace Bootcamp a quick glance. It's worth your time.

To badly summarize, DTrace makes it painless and safe to carry out surprisingly deep and wide experiments on a running system, from TLB miss code all the way up to Java method invocations. The improvement in system visibility that DTrace represents is comparable to the improvement of a source-level debugger over printf. I'm serious. If you don't believe me, give it a try. (Full disclosure: I went to school with DTrace's founding trio, but believe me, DTrace is so wig-flippingly great that I'd be just as effusive if I didn't know Adam from Adam).

The only unfortunate thing about DTrace is that it is part of Solaris. Nothing against Solaris, mind you. Most of my colleagues regard me as a Solaris zealot, in fact. It's where I came of age as a programmer, and when I have the all too rare pleasure of using it, Solaris still feels like home (/usr/proc/bin!), even after five years of continuous Linux usage.

But let's not kid ourselves. Solaris is in trouble. Not technically; I still believe it's the gold standard for UNIX excellence in design and implementation, and I'll take the bait from any Linux zealot who'd like to argue this. No, Solaris is troubled because it has been losing users. In spite of its recent reincarnation as opensource software, people still perceive Solaris as Sun's house UNIX. And, for those who've been in North Korea for the last five years, Sun is not a fiscally healthy organism.

So, I was pleased today to learn that RedHat, IBM and Intel are doing the only sensible thing, namely ripping off DTrace with total abandon. More power to them; reimplementing good ideas from industry has a long tradition in OSS and Linux in particular. There are, of course, some differences between DTrace and SystemTap; I haven't gotten deeply enough into the available SystemTap documentation to say just what they are.

Godspeed you, SystemTap! I hope to be using something with all the convenience and power of DTrace on a viable operating system sooner rather than later! In the meantime, I'll fire up my gentoo VM, pull down the CVS sources, and cross my fingers that I can get this all to work...


Greetings, imaginary audience! My name is Keith Adams. I'm an engineer in VMware's Virtual Machine Monitor group. I'll be writing here about the x86, operating systems, the hardware/software interface, virtualization, software development in general, etc.

I've been at VMware since 2000, when I graduated from Brown University's CS Department. Lately, I've been preoccupied with the long list of monitor features in VMware Workstation 5.5: 64-bit, support for Intel's VT, hosted SMP support, etc.