Instruction Re-writing :: Function Re-routing

© Amit Singh. All Rights Reserved.Written in Early 2003

Consider the following common (well, maybe not so common) problem: You have written a loadable kernel module (kernel extension, in Mac OS X parlance). There is some function foo() in the kernel that you would like to re-route, that is, replace with your own version called my_foo(). What's more, you would like to be able to call the original foo() from my_foo().

There are several things to be observed here:

This kind of thing is really easy to do with VFSs. Traditionally, you deal with a collection of function pointers, and it is easy to stack such collections where a specific function bar() in each layer do many things: pass through to the layer below, pre- or post- process arguments/results, block the call from passing through and so on.

Things are a little harder when there are no function pointers. In the situation described earlier, foo() is a function at some address in the kernel. There may be an arbitrary number of callers that invoke this function. What follows is a scheme that allows you to replace foo() with my_foo() so that calls to foo() result in my_foo() being invoked. my_foo() can call foo() itself though. Consider the picture shown on the right.

Assume that foo() and my_foo() are present as shown, with their respective instructions being I0 I1 I2 I3 ... and I'0 I'1 I'2 I'3 ... respectively.

A caller of foo() eventually executes I0. If we were to re-route foo() so that my_foo() would be called instead, an obvious thing to do would be to replace I0 with an unconditional branch to my_foo(). That would effectively leave foo() out of the picture, and all calls to it would implicitly end up invoking my_foo(). This is fine if we do not wish to call the real foo() at all, but more often than not you do not wish to clobber foo(), but want to be able to call it, say from within my_foo(), as discussed earlier. A little more work is needed to achieve this.

We save I0, the clobbered instruction from foo() somewhere. Then we allocate a region of memory large enough to hold a few instructions and mark it as executable (via a call such as mprotect()). We call this region foo_stub. Note that it is not necessary to dynamically allocate this region of memory - a very convenient way to "pre-allocate" such a region would be to have an actual function (that does nothing) foo_stub() alongside my_foo(). Finally, we copy I0 to the beginning of foo_stub. The very next instruction would be an unconditional branch to foo + 4. Thereafter, the "original" foo() can be invoked via foo_stub().

Note that while the demonstration source code accessible from this page (for PowerPC/Mac OS X) does work, it should be considered a prototype. If you were to do such re-routing at production level, you would have to worry about various issues: invalidating the instruction cache, multi-processor complications, the maximum possible distance between foo(), my_foo() and foo_stub() etc. These issues will greatly depend on the specific processor you are working on (SPARC, PowerPC, x86 ...).

The "distance" issue is more severe on certain processors. A jump (unconditional branch) on MIPS uses 6 bits for the operand field and 26 bits for the address field. The effective addressable distance is 28 bits (4 times more) because MIPS refers to the number of words instead of bytes. This works because all instructions are 4 bytes long. 28 bits gives you 256 MB of leeway.

SPARC uses a 22 bit signed integer for branch address, but has two zero bits appended to it, effectively a 24 bit program counter relative reachability. This gives you 16 MB (+- 8MB) of leeway.

The unconditional branch in our target architecture (PowerPC) has a 24 bit address field, but as in MIPS, all instructions are 32 bits long, so PowerPC refers to words instead of bytes. The effective branch address (LI || 0b00) is 26 bits wide, which gives you 64 MB (+- 32 MB) as the maximum distance you can jump. Refer to the picture below for details.

In the provided sample code, we assume that our jump will be less than 32 MB in absolute value (no exception handling is attempted if this is not the case!). You are very likely to run into situations when this would not be the case: it could be that you are trying to re-route a Mac OS X kernel function that is too far from your version of that function. Though innovative solutions might be possible depending upon the instruction set/processor in question, but a straightforward solution would be to find and use an unused region of memory near that function.

Finally, please note that on a production Mac OS X system utilizing this technique, you would use something like dcbf and/or icbf to flush the cache(s). The sample code does not do that, but simply calls an arbitrary function (say, sync()) which effectively has the same result in our case.