I blame the complexity of web browsers, and their security issues, and the fact that web development is a mess on the operating system. Yes, thats right, I blame the OS.

Whats wrong with out OSs?

The concept of "User Space" vs "Kernel Space" differs little from "User Space" and what ever you call the box your web apps run in. And what is a web app anyway? Well, you sure don't want to run it in user space, but it is an application containing excitable code. You could make a user for each domain to run their java script in, and that would be about whats desired (and approximated by javascript VMs in some respects), but that requires the browser to controls lots of users. They can spin up a bunch of processes to isolate them, but they don't really come with the full suit of permission control in any easy to use manner.

What I want

What I want is a simple permission tree of processes. Just like the OS exposes a "small" set of methods to the next level via syscalls, I'd like to have each process expose a set of methods to its children. This can be done with a special syscall that is indirected via a parent process specific table. We shall call these parent-calls.

In this design, when a process is created, it is associated with a pointer into the memory space of its parent process where the handler for the parent-calls. At this location is usually a bounds check on some register (containing the call index), then a jump into the corresponding jump table, but there could be additional checking, such as using a per child process mask if there need to be children with different permissions. Also, depending on the thread safety of the parent, such parent calls may block and be handled by and event handler thread. Its all up to the parent process, and thus not part of the OS!

So far our design has introduced 1 extra syscall. Now, for security, we remove all other syscalls, and use the parent-call system to allow direct children of the kernel to make what would normally be syscalls, as parent calls (why have 2 ways to do things?). Really the only difference this makes is that non direct children of the kernel can't call the any syscalls at all. They have no access to any shared data, connections or such (no networking, no file system) unless their parents explicitly expose it.

There, perfect sand-boxing, by default, for all processes. No need for VMs, managed code, or any of that stuff. It works recursively too: if a process is permitted to create more processes, each of its children can at most access what its parent can (but by default can access nothing, which means they can't even create processes, touch the file system or network). Perfect.

It should be easy to implement, and possible to make work with existing systems (just add a per process mode bit that makes processes work this way, and handle it in the syscall handler)

To make the mapping of which ID goes to which call changeable, a parent process could preload a symbol table for them into the child processes memory.


The problem? It would be completely impractically slow, but the concept is good. Can it be speed up, through implementation tricks, without changing the behavior? Yes.

First, why its so slow: syscalls are slow. If I want to open a file in a modern operating system, thats 1 syscall. With my design, thats one sys-call per parent between the process and the kernel (including the kernel).

So the trick to speed it up is to allow calls that should propagate up multiple levels to do so without leaving the kernel (and thus without incurring more syscall overheads).

To do this, each process gets a table of calls that are directly forwarded to their parents. If the call ID is outside the table, it calls the processes handler in user space to handle as needed. Thus, no functionality is lost (just request a table of size 0 to get the old behavior), but some is gained (calls that simply need forwarding to the parent never enter the processes's user-space). If two processes have the same contents in their tables, the tables can be shared, and they should be pretty small.

This means that the cost for common operations, like opening a file is still higher that in most current operating systems, but the difference is just a few indirections, one for each parent process up to the kernel. Caching could help that is really needed, but it should not be a significant issue.

What is gained

A child process can easily be created with its entire access to the world outside its memory constrained to a set if RPC calls it can make the its parent (and optionally its parent's parent…). Now thats a solid sandbox.

With some libraries it would be possible to load a DLL (or several) into a child process, and link across the boundary with parent calls. Just generate and link a shim on either side. Then you can, simply by using a different DLL loader, sandbox random chunks of code and only expose a white listed set of functions. If you get the child process to share a data segment of memory with the parent, you could still pass things by reference as long as you allocated them in the shared area.

Thats completely restricted, but fully arbitrary native code running as full speed as a regular process, with a simple mechanism for calling the parent/owner of it.

Web browsers

Browser plugins can be run in a process that only has access to some very restricted things, and can run sub processes that are even more restricted.

Chromium tries to do something like this, but given the current OS support and design practices, it like like a challenge, and not a simple, customizable or a elegant as it should be.

Some Omitted Details

To call across the child->parent boundary, a mechanism is needed to interrupt the parent (the kernel can't simply jump to userspace: that would descrioy all security!), and pass in some info. Unix Signals (or something very similar) would handle this well.

Things like this

Genode Looks like a fantastic implementation of everything I discussed here, including a web browser with nice secure but very capable plugins (including Linux as a browser plugin!)
Qubes OS looks like it does something related to this. I'm not sure how it works, but it looks interesting.

Copyright © 2011-2013 Craig Macomber