Pioneer and Virtualization
The very first paper in SOSP '05 is a fine example of why a deep understanding of virtualization is now a necessity for doing systems work. A number of folks from CMU (Seshadri, Luk, Shi, Perrig, and Khosla) and one from IBM (van Doorn) describe an enterprising system they name "Pioneer." Pioneer aims to do what AMD's SVM and Intel's LT presume is impossible: reliably measure and attest modern, mostly unmodified kernels. It's a really tough problem, and their approach is novel. However, on reading their paper, I'm convinced VMware's virtual machine monitor is an existence proof of the system's vulnerability. The paper considers this possibility, and rejects it, based on some unfounded assumptions about modern VMMs.
Pioneer's central code-tampering prevention technique is external performance measurement: an external entity challenges the checksum code to attest to the state of the kernel. The external entity not only issues the challenge, but, using very nitty-gritty knowledge about the speed and microimplementation of the challenged host, issues the challenge with a time limit. Executing for longer than the time limit constitutes evidence of tampering, leading the external entity to consider the machine compromised. A corollary of this design is that the checksum function must be micro-architecturally optimal for the hardware on which it is run. If it leaves any spare execution resources unutilized, an attacker might use those parallel execution resources to paper over the holes that he has punched.
The paper goes into much greater detail, but here are some of their responses, paraphrased, to obvious objections:
1. How do you know that the checksum code is checksumming the data it ought to be checksumming? The data is checksummed in a pseudo-random order, and the data pointers themselves are inputs to the checksum. If the kernel is attacked, and the attacker wishes to continue answering challenges, it will have to store the "old code" somewhere to be checked. To manipuate the data pointers within the checksumming code will cause a decrease in performance, which can be detected by the external measuring agent.
Sounds good enough, but I disagree with the assumption that the "old data" and "new data" need to live at different virtual addresses. Why not create a different address space in which to keep the "old kernel" for measurement purposes, while keeping around a primary execution address space containing a compromised kernel? An attacker taking this strategy would intercept (using some memory-invisible technique, like the hardware breakpoint facility) the measurement code to change the pagetable pointer to the innocent-seeming address space on entry and exit. Since the address space switch is outside of the main loop of the checksum code, the odds that the external verifier will notice the performance difference are slim, as page table assignments are lost in the noise when compared to a network I/O. I will be curious to hear the authors' thoughts on this apparent hole in their system. The objection seems so simple and straightforward that I suspect there's something I'm just not getting.
2. What about VMMs? Here's where things get more interesting. Why doesn't running the guest kernel within a hostile VMM constitute a fool-proof attack method? The Pioneer folks consider this prospect in their paper:
"Since we assume a legacy computer system where the CPU does not have support for virtualization, the VM must be created using a software-based virtual machine monitor (VMM) such as VMware*. .... If the adversary tries to cheat by using a software VMM, then each read of the flags register will trap into the VMM or execute dynamically generated code, thereby increasing the adversary's checksum computation time."
...And since the system is very sensitve to changes in checksum execution time, the thinking goes, this attack will fail. They're assuming that writes to eflags.IF will "trap out" or "execute dynamically generated code", and that these will, without further argument, be slower than the native CLI/STI instructions. While this seems intuitively plausible, it's wrong, and I think it's wrong for interesting reasons.
The CLI and STI instructions on modern processors are quite slow, because they have very complex effects on the following instruction stream. For instance, there are OS'es whose idle loop only enables interrupts for a single instruction! All this interrupt stuff in general is exactly the sort of thing that gives today's deep and wide architectures fits, and STI and CLI times of more than 100 cycles are not unheard of on popular x86 implementations.
VMware uses multiple techniques for making forward progress in guest execution. When executing kernels, VMware often runs guest code via a purpose-built binary translator. It's an unusual binary translator, because, rather than translating between two unrelated architectures, it translates supervisor-mode x86 programs to user-mode x86 programs. This means that a whole lot of the dynamic runtime of our translator is spent in memcpy; after all, the vast majority of kernel instructions are the innocent loads, stores, branches, and ALU ops that don't require heavy intervention on the part of a VMM.
The Pioneer folks are right to guess that CLI and STI are not among those innocent instructions: we translate both of them into a small user-level program fragments. These translations basically load the software flags register, perform a single ALU operation on it, clearing or setting the interrupt flag, and then stores it back to memory. The interesting part of this all is that, since the translated code produced by VMware's VMM consists entirely of the bread-and-butter instructions that Intel and AMD go to great lengths to make fast, CLI and STI operations are much, much faster when executing in a VMware VM than on native hardware.
No. Really. Seriously.
Don't believe me? Try it yourself. Here's a kernel module that executes a simple STI benchmark:
On my 1.8Ghz pentium M laptop, running the VMware Workstation 5.5 release candidate under a Fedora Core 4 guest, our VMM executes about 200000 iterations in a single millisecond. That works out to around 200 million STIs per second, or something like 9 cycles per iteration. When you look at the code that GCC generated**, you realize that the VMware VMM is executing a STI, two ALU ops, a load, and a predictable branch in about 9 cycles. While I don't have a Linux host on which to test***, I'll go out on a limb and claim we're probably beating hardware by something like 5-10x on this microbenchmark.
Of course, this is a special case. Virtualization can't make all guest kernel code faster, by the old "you can't fit a gallon of milk in a pint bottle" counting argument. But, it's an interesting real-world demonstration of the dangers of making plausible assumptions about performance, especially when it comes to virtualization. Be careful! And if you're resting your security arguments on performance properties of a virtual machine, measure, rather than assert.
* Sic. The first sentence is probably referring to VT and Pacifica, the coming hardware virtualization technologies from Intel and AMD respectively. However, note that in both of those systems, VMs are still "created using a software-based VMM"; VT and Pacifica don't get rid of your software VMM, folks. If they work as planned, they make your VMM radically more simple than VMware's. But the VMM is still sitting there, a big chunk of software that can deceive the guest into thinking almost anything it wants. Note that I take no responsibility for this abuse of our trademark. VMware is a company that makes a VMM among many other things; VMware is not a VMM.
**
*** Man, there goes all my UNIX street cred! It's true, it's true; I don't have enough free time to keep WIFI under Linux working. I'll repent, I swear it.
Pioneer's central code-tampering prevention technique is external performance measurement: an external entity challenges the checksum code to attest to the state of the kernel. The external entity not only issues the challenge, but, using very nitty-gritty knowledge about the speed and microimplementation of the challenged host, issues the challenge with a time limit. Executing for longer than the time limit constitutes evidence of tampering, leading the external entity to consider the machine compromised. A corollary of this design is that the checksum function must be micro-architecturally optimal for the hardware on which it is run. If it leaves any spare execution resources unutilized, an attacker might use those parallel execution resources to paper over the holes that he has punched.
The paper goes into much greater detail, but here are some of their responses, paraphrased, to obvious objections:
1. How do you know that the checksum code is checksumming the data it ought to be checksumming? The data is checksummed in a pseudo-random order, and the data pointers themselves are inputs to the checksum. If the kernel is attacked, and the attacker wishes to continue answering challenges, it will have to store the "old code" somewhere to be checked. To manipuate the data pointers within the checksumming code will cause a decrease in performance, which can be detected by the external measuring agent.
Sounds good enough, but I disagree with the assumption that the "old data" and "new data" need to live at different virtual addresses. Why not create a different address space in which to keep the "old kernel" for measurement purposes, while keeping around a primary execution address space containing a compromised kernel? An attacker taking this strategy would intercept (using some memory-invisible technique, like the hardware breakpoint facility) the measurement code to change the pagetable pointer to the innocent-seeming address space on entry and exit. Since the address space switch is outside of the main loop of the checksum code, the odds that the external verifier will notice the performance difference are slim, as page table assignments are lost in the noise when compared to a network I/O. I will be curious to hear the authors' thoughts on this apparent hole in their system. The objection seems so simple and straightforward that I suspect there's something I'm just not getting.
2. What about VMMs? Here's where things get more interesting. Why doesn't running the guest kernel within a hostile VMM constitute a fool-proof attack method? The Pioneer folks consider this prospect in their paper:
"Since we assume a legacy computer system where the CPU does not have support for virtualization, the VM must be created using a software-based virtual machine monitor (VMM) such as VMware*. .... If the adversary tries to cheat by using a software VMM, then each read of the flags register will trap into the VMM or execute dynamically generated code, thereby increasing the adversary's checksum computation time."
...And since the system is very sensitve to changes in checksum execution time, the thinking goes, this attack will fail. They're assuming that writes to eflags.IF will "trap out" or "execute dynamically generated code", and that these will, without further argument, be slower than the native CLI/STI instructions. While this seems intuitively plausible, it's wrong, and I think it's wrong for interesting reasons.
The CLI and STI instructions on modern processors are quite slow, because they have very complex effects on the following instruction stream. For instance, there are OS'es whose idle loop only enables interrupts for a single instruction! All this interrupt stuff in general is exactly the sort of thing that gives today's deep and wide architectures fits, and STI and CLI times of more than 100 cycles are not unheard of on popular x86 implementations.
VMware uses multiple techniques for making forward progress in guest execution. When executing kernels, VMware often runs guest code via a purpose-built binary translator. It's an unusual binary translator, because, rather than translating between two unrelated architectures, it translates supervisor-mode x86 programs to user-mode x86 programs. This means that a whole lot of the dynamic runtime of our translator is spent in memcpy; after all, the vast majority of kernel instructions are the innocent loads, stores, branches, and ALU ops that don't require heavy intervention on the part of a VMM.
The Pioneer folks are right to guess that CLI and STI are not among those innocent instructions: we translate both of them into a small user-level program fragments. These translations basically load the software flags register, perform a single ALU operation on it, clearing or setting the interrupt flag, and then stores it back to memory. The interesting part of this all is that, since the translated code produced by VMware's VMM consists entirely of the bread-and-butter instructions that Intel and AMD go to great lengths to make fast, CLI and STI operations are much, much faster when executing in a VMware VM than on native hardware.
No. Really. Seriously.
Don't believe me? Try it yourself. Here's a kernel module that executes a simple STI benchmark:
#include
#include
int
init_module(void)
{
int i;
unsigned oldjif = jiffies;
while (oldjif == jiffies)
; /* wait for start of new timer epoch */
for (oldjif = jiffies, i = 0; jiffies == oldjif; i++) {
__asm__ ("sti" : :);
}
printk(KERN_ALERT "bogosti: %d hz %d\n", i, HZ);
return 0;
}
void
cleanup_module(void)
{
}
On my 1.8Ghz pentium M laptop, running the VMware Workstation 5.5 release candidate under a Fedora Core 4 guest, our VMM executes about 200000 iterations in a single millisecond. That works out to around 200 million STIs per second, or something like 9 cycles per iteration. When you look at the code that GCC generated**, you realize that the VMware VMM is executing a STI, two ALU ops, a load, and a predictable branch in about 9 cycles. While I don't have a Linux host on which to test***, I'll go out on a limb and claim we're probably beating hardware by something like 5-10x on this microbenchmark.
Of course, this is a special case. Virtualization can't make all guest kernel code faster, by the old "you can't fit a gallon of milk in a pint bottle" counting argument. But, it's an interesting real-world demonstration of the dangers of making plausible assumptions about performance, especially when it comes to virtualization. Be careful! And if you're resting your security arguments on performance properties of a virtual machine, measure, rather than assert.
* Sic. The first sentence is probably referring to VT and Pacifica, the coming hardware virtualization technologies from Intel and AMD respectively. However, note that in both of those systems, VMs are still "created using a software-based VMM"; VT and Pacifica don't get rid of your software VMM, folks. If they work as planned, they make your VMM radically more simple than VMware's. But the VMM is still sitting there, a big chunk of software that can deceive the guest into thinking almost anything it wants. Note that I take no responsibility for this abuse of our trademark. VMware is a company that makes a VMM among many other things; VMware is not a VMM.
**
41: fb sti
42: 83 c1 01 add $0x1,%ecx
45: a1 00 00 00 00 mov 0x0,%eax
4a: 39 d0 cmp %edx,%eax
4c: 74 f3 je 41
*** Man, there goes all my UNIX street cred! It's true, it's true; I don't have enough free time to keep WIFI under Linux working. I'll repent, I swear it.
3 Comments:
I'll have to ask Leendert about this, but I think your misunderstanding the checksumming algorithm. While it is true that it's possible that CLI/STI are faster in a VMM, that is only done during the initialization phase of the checksum.
EFLAGS is incorporated for the entire checksum (presumably by a pushf/popf which VMware I would imagine produces slower code for) to ensure that interrupts are disabled. This can't be faked b/c of the state flags. Computing state flags with dynamic translation is hard and has a performance hit so I'm pretty sure that it would end up covering whatever savings you would get from cli/sti.
The basics of the theory here is that you cannot do a series of computations faster than hardware can. You may be able to for cli/sti but that's probably because the true cost of the virtual cli/sti is being amortized.
Note, this general approach is not totally novel. It's essentially challenge/response plus high resolution expirations (think kerberos with a crazy low timeout).
I think the weakest part is the uncertainity of the communications channel.
I thought I should be more clear about my point about pushf.
pushf/popf cannot be memcpy()'d b/c you would have to at least modify IOPL and IF appropriately. This means each generated sequence for pushf would have to do an actual pushf (to get the arithmetic flags) and then merge that with the virtual eflags. Assuming you just do a AND to mask out IOPL and IF (hey, why not, is dynamic translation right :-)), you still have a 1 cycle hit per pushf. Since the checksum is going to iterate a high number of times (well over 100), you're going to mask whatever speedup occurred by the cli/sti.
With SVM or VT, you can likely avoid this by not taking an exit on pushf/popf which you could use to defeat these timing checks. At least SVM allows you to leave IOPL at 0 and set a separate PIO bitmap.
Thanks, Anthony. Your reasoning seems spot on; the pushf translation is, indeed, a superset of the work that hardware pushf does. I also agree that appropriate hardware assistance for virtualization would enable a more compelling VMM attack.
Post a Comment
<< Home