Coding for stability

You know, it always kills me, the Linux vs MINIX debate, Linus vs Tanenbaum. Everyone loves Tanenbaum’s ideals, but can’t refute Linus’ pudding – the orders of magnitude faster Linux is than MINIX. You’d think that with the focus these days more on coding for reliability, simplicity, elegance and maintainability – all at the expense of performance – would swing the argument in favour of the microkernel architecture. Yet it doesn’t. Sure, MacOS X uses a so-called microkernel architecture, although Apple are the first to admit that it’s really some kind of weird hybrid. They just couldn’t get the performance they needed out of a pure microkernel implementation.

But another reason perhaps is that the arguments in favour of microkernels are largely fluff. The advantages in stability and security oft-touted don’t go nearly as far as the proponents would like us to believe. The common line is:

“For example, each device driver runs as a separate user-mode process so a bug in a driver (by far the biggest source of bugs in any operating system), cannot bring down the entire OS. In fact, most of the time when a driver crashes it is automatically replaced without requiring any user intervention, without requiring rebooting, and without affecting running programs. These features, the tiny amount of kernel code, and other aspects greatly enhance system reliability.”

The MINIX 3 website

Now, let’s just hold our horses there. Sure, if an unused driver crashes, it can be reloaded and none will be the wiser. But that’s not how it works, is it? See, drivers have state. And when they crash, they lose that state. All the hyperbole in the world isn’t going to magically restore it. Even if you could, should you? Having the reloaded driver restored to the state it was in just before it crashed may not do any better the second time around – it may just crash exactly the same way again.

And what about everything else that’s using the dead driver? Well shit. I mean, we just don’t think about these issues properly. If I’m writing to disk via some file system driver, which crashes sometime during the write, I’m in trouble. Can the OS automagically restore the driver and continue or repeat the write without me knowing? I doubt it. So what happens? Well, I guess my program will get an error back from the write. But what if the write did actually go down to the physical media before the crash? Oh oh.

So we have all this journalling and so forth… but really, it’s a bad solution in the long term; journalling just adds more places where things can go wrong.

So now our driver has to be able to figure out what it’s already done. Maybe it can do that, sure. But does it? In today’s drivers? I doubt it.

You see, the focus on microkernel’s is really just taking to a wheat harvest with a pocket knife. It’s not thinking on the appropriate scale. What microkernels critically provide is simple, defined and protected interfaces between modules. That’s all it is.

But, you see, the best place to do everything is at compile time, not run time. Errors at run time piss off your users, since they’re the ones running them. No, what we need are smarter compilers. Compilers with defined limits on parameters, more explicit type checking.

And yet I’m a big fan of Objective-C? Why is that? Can these two bipolar titans be married? I like to think so. See, let’s take to an example to explain this simply. I have a function for doing logs, like so (in traditional C):

double loge(double x) {
  /* Perform magic arithmetic here */
  return result;
}

Now that’s not much good, really. What if someone tries to pass a negative value, or zero? Well that’s not much good; natural log’s not defined (in the real domain; we are working with doubles here) for negative values (and loge(0) goes to negative infinity as a limit; it has no actual value). So, sure, the standard way of going about this is to either use algorithms which handle this implicitly, or more generically to do parameter checking:

double loge(double x) {
  if (0.0 >= x) {
    // Barf!!!
  } else {
    /* Perform magic arithmetic here */
    return result;
  }
}

Now, here’s the problem… how do we “barf”? Do we return a symbolic value that represents NaN (Not a Number)? Perhaps we just return 0? Perhaps we raise an exception (if we’re using C++)? Perhaps we call exit()? Perhaps we just while (1) {} just to annoy the caller?

Is the compiler smart enough to realise that we’re using a bad value when we invoke the function? Nope. Even if I use assert() or some similar standard procedure, the compiler won’t even issue a warning on the parameter. Compilers just don’t do range or domain checking (at least, gcc doesn’t). It would be a good start if they did.

So what we need are stricter types – we need definition’s of domains and conditions. Imagine, if you can, if we could do this:

double loge(double x) where (0.0 < x) {
  /* Perform magic arithmetic here */
  return result;
}

In this mythical language – let’s call it ++C – the compiler would issue an error if we tried to violate the explicit condition we’ve provided. Thus it would be impossible, given a suitably intelligent compiler, to call this function with an inappropriate parameter.

Of course, people will say “but how is the compiler to know what values we’re going to use if we’re taking them from the user, for example?”. “Output one, accept any”. Sources such as files could have any value of course, so the compiler should assume any possible value. Thus, if we wanted to take our value “input” and pass it to loge, we’d have to do our own explicit range checking in the caller.

So, we’ve just shifted the problem to a different place, right? Well, yes and no. Basically yes, which is a good thing on it’s own. Now the caller – which is going to know more about the data than the callee – must make the decisions about how to handle invalid data. Excellent. We move input validation right to the very top of our program, where it should be. This is the way it should always be done anyway – it keeps your core code simpler, leaner and faster, since it doesn’t have to perform redundant parameter checking. In reality, because there is no compiler enforcement of parameter limitations like this, we end up with a lot of redundancy as checks are put in at multiple layers, just to be safe. Even then, we still miss things.

This could even be integrated into existing compilers in a completely backwards compatible way, by having some kind of error marker which the compiler could use to determine which code branches should never intentionally be taken. For example, the compiler could see the following:

double loge(double x) {
  if (0.0 >= x) {
    INVALID_PARAMETER(x);
    // Barf!!!
  } else {
    /* Perform magic arithmetic here */
    return result;
  }
}

It could perform the same kind of analysis as talked about previously, but instead of an explicit addition to the language syntax, it just looks for this INVALID_PARAMETER macro invocation. If it’s analysis indicates it’s possible to reach this point in the code, it can issue an error – or at the very least a warning. This can be tied in with dead code stripping as well; if the compiler can prove at compile time that no INVALID_PARAMETER‘s are reached, it can remove all the code in the same scope as the invocation. Fantastic!

I should add, there’s a language called D which I believe tries to adopt something like this. I’m yet to encounter a D compiler – although I haven’t looked – but until it makes it into gcc, it’s not going to get the large-scale support such features need.

So, with this advanced compile-time analysis, we get the benefits of microkernels – and more – without the runtime performance costs. We can all follow the Linux standards and code like psychotic schizophrenics, and still get the safety we need.

Well, okay, so there’s a lot of other stability and security issues beyond just this, but I think it alleviates a major fraction of the problems. Throw in more intelligent compiler behaviour regarding pointers, arrays, etc, and we’ll be set.

Leave a Comment