Baby Steps

A blog about programming, and tiny ways to improve it.

Syntax Matters…?

For a long time, it was considered fairly obvious, I think, that syntax didn’t really matter. It was just the surface skin over the underlying ideas. In recent times, though, the prevailing wisdom has reversed, and it is now quite common to hear people talk about how “syntax matters”.

While I don’t exactly disagree, I think that the importance of trivial syntactic matters is generally overemphasized. It is not a matter of life and death whether or not semicolons are required to end a line, for example, or whether parentheses are required in making a call.

Naturally, like all programmers, I have strong opinions on these topics myself—or at least I used to. But I’ve found over time that one gets used to these matters quickly enough, for the most part. But I think there is a deeper sense in which syntax can matter.

Basically, there are some languages whose syntax is so distinctive that it makes a qualitative difference to the experience of programming. In this case, having a different syntax can enable something otherwise very challenging—or sometimes make simple things extremely difficult. Three examples come immediately to mind, I’m sure there are more.

Lisp

Lisp (and its derivatives: scheme, clojure, etc) is a common example. The Lisp family of languages is fairly unique in that the syntax of programs is also the same syntax of the data structures in the language (oddly, XSLT is the only other example I can think of; but I’m sure there are more). This is sometimes, and somewhat grandiously, referred to as the “homoiconic” property.

Homoiconicity makes it possible to have a very simple macro system which can seamlessly integrate with the language. This is simply very, very challenging to do with a more traditional C-like syntax. So, in this case, syntax really matters.

Smalltalk

Most languages pass the parameters to a method using a position notation. This may be written with parentheses (foo(a, b, c)) or without (foo a b c) but the idea is basically the same. Smalltalk took a different approach. In Smalltalk, each parameter is labeled, and the name of the method as a whole is the concatenation of all of these labels. So you don’t write foo.open("abc", true, false) but rather foo open:"abc" read:true write:false. This may seem like a small change, but it is not. It has far-reaching consequences; consequences which I think are not fully appreciated. For example, it is no accident that Smalltalk pioneered most of the powerful refactorings we associate with Java and other statically typed languages today—method names in Smalltalk are long and generally unique, so you don’t need full type information for the compiler to reliably trace them.

Another effect of this convention is to make certain classes of errors impossible. For example, one simply cannot provide the wrong number of parameters to a method (the method name would not match). Similarly, it is obvious what each parameter means and in what order they should go. With a call like foo.open("abc", true, false), the reader has no idea what true and false signify. When the call looks like this, foo.open("abc", write, read), the reader thinks they know what write and read signify, but without seeing the source of the method, they can’t know for sure. In fact, here, I reversed the order. This kind of error is surprisingly common, as an old colleague of mine described in a paper. But this error is unthinkable in Smalltalk, as you would have to write foo openFile:"abc" read:write write:read, making it quite clear that something was amiss.

Fortress

I had the good fortune of talking to some of the guys on the Fortress team a while back. Fortress is home to some very interesting ideas—parallel evaluation by default, for example!—and one of them is that mathematical programs ought to be written in mathematical notation, or something very close to it. I can see that this is a very appealing notion for mathematicians and physicists, as it will help to lower the impedance barrier between the program and the theory it models. But it is also interesting for the developers of Fortress, since mathematical notation is incredibly overloaded—meaning that they are developing all manner of interesting new dynamic overloading resolution techniques to make this whole thing work. So in this case, the syntax is a mixed bag: it makes some things easier (translating math) and some things harder (defining the language semantics).

The siren call of pretty syntax

So yes, I do think syntax can matter. But most of the time it doesn’t. This is not to say I am immune to the appeal of pretty syntax (I’m as guilty as everyone else), nor that prettiness doesn’t matter at all. But mostly it’s a matter of familiarity. Like anything else, a new language will look a bit different, and you have to get used to it. (Even Objective C looks pretty good to me now, for crying out loud!) Sometimes, though, things still seem hard to read even after you’ve been hacking in the language for a while: these are things that need changing.

It is important, however, to distinguish between syntax and expressiveness. I don’t care (too much) whether you write function(x) { ... }, \x -> ... or { |x| ... } to denote a closure, but there had better be a way to write a closure somehow! (Java, I’m looking at you here)

It makes me a bit sad that there is so much focus these days on the surface side of syntax—making indentation significant, omitting a semicolon—but rather little on how a change in syntax can actually change the experience of programming in a deeper way.

Aside #1: It is too bad that the genuine advantages of Lisp and Smalltalk syntax do not seem to have been sufficient to win over the familiarity of a generally C-like look-and-feel.

Aside #2: In case you can’t tell, I’m partially responding to the fact that every time somebody posts a link about Rust, somebody else makes some comment about the length of our keywords. My personal favorite is this one, which seems to imply that we Rust developers are involved in some kind of conspiracy—as if we prefer endlessly defending our choice of ret over return rather than, say, our choice of sendable unique pointers over shared memory. Please.

Aside #3: To be clear, I don’t think Rust’s syntax is anything revolutionary. It is basically a C derivative, like so many languages these days.

DOA: Don’t Overabstract

I’d like to propose a term for code that has been “over-DRY’d” (dessicated?). I occasionally run across some method which just seems horribly complex. Reading it closer, it usually turns out that what happened is that two or three independent operations got collected into one subroutine. Perhaps they started out as doing almost the same thing—but before long, they diverged, and now the subroutine has grown a hundred parameters and has a control-flow path that requires a whiteboard and a ultra-super-fine-point marker to follow. But, just as often, you can tear this routine apart into two or three routines that read just fine, even if they share a line or two of code in common. So I’m going to start calling such routines “DOA”, though the acronym has a bit of a different expansion when used as an adjective.

Declared vs Duckish Typing

One of the questions in our object system is what precisely how “declared” we want things to be when it comes to interfaces and implementations. In a discussion on IRC, graydon suggested it’d be nice to have terms like “duck-typing” defined more precisely in a Rust syntax, and he is correct. So here is my effort.

The current setup

Currently, implementations must declare precisely what types they implement. For example, it looks like this:

impl of draw for T {
    ...
}

where draw is an interface. Then, later, if we have an instance of type S and we wish to know whether it implementations the interface draw, we can scan through the set of implementations that are declared to implement draw and see if any of them are for the type S.

A more duck-typing like setup

Another option would be to remove the requirement that an impl declares what interfaces it implements. In that case, when we have a need to know if the type S implements the iface draw, we would again scan all of the implementations in scope for the type S. For each one, we would check whether it contains all the methods defined in draw. If so, we declare to be an implementation of the iface (we must also check that the methods contain the right types; it’s unclear to me whether we should do this check before or after deciding that it is an implementation, though).

Why duck typing?

It’s more convenient. There is also, currently, no good way to create an “after the fact” interface: support I have a bunch of types that all already have a draw() method and a bounds() method defined, and I’d like to make an iface like:

iface draw_and_bounds {
    fn draw();
    fn bounds();
}

and then just use it. Now everything just works. In the more statically declared world, I would then have to go over each type and do something like:

impl of draw_and_bounds for S { }
impl of draw_and_bounds for T { }

These impls just serve to declare that the type S (and T) implements the iface draw_and_bounds (and needs no additional methods to do so). Actually, this wouldn’t work today at all, because we don’t check for existing methods when deciding, so you’d really have to do something like:

impl of draw_and_bounds for S {
    fn draw() { self.some_other_impl::draw() }
    fn bounds() { self.some_other_impl::bounds() }
}

But of course the some_other_impl::draw syntax for naming a method isn’t implemented, so you’d have to do something like:

fn my_draw(self: S) { import some_other_impl; self.draw(); }
fn my_bounds(self: S) { import some_other_impl; self.draw(); }
impl of draw_and_bounds for S {
    fn draw() { my_draw(self) }
    fn bounds() { my_bounds(self) }
}

But we could fix that by implementing features.

Why not?

Just because methods with the right names are available doesn’t mean that they will do what you expect. Maybe you mean draw() as in “draw your gun” not “draw yourself on the screen”. It also prevents ‘marker interfaces’, like Java’s serializable.

Non-obvious implications and small design decisions

Simplicity and compilation time

One of the arguments for a non-duck-typing scenario is that it makes the system easier to implement. We can generate the vtable at the point of impl declaration and then refer to it from other places, rather than having to generate the vtable lazilly as needed.

It seems to me that it would affect compilation time. It’s bound to be faster to check compliance with the iface once, at the impl, then at each point of invocation. However, we can cache these results, so that’s probably not a big deal.

Frankenstein impls

A big open question (to me) is whether we should consider an interface to be implemented if all the necessary methods are available but they come from different sources. For example, consider something like:

impl draw for T { fn draw() { ... } }
impl bounds for T { fn bounds() { ... } }

Now, in a duck-typing world, is the draw_and_bounds iface implemented or not? It seems to involve a similar set of tradeoffs. If the answer is that they are not implemented, we need to write something explicit like impl draw_and_bounds for T { ... } just as we had to do when not using duck typing at all.

Still, I think that we should disallow such “frankenstein” impls. The main reason is that it makes instance coherence just about impossible to address (more on that in a later post, but in short form it prevents us from concisely naming the origin of the iface methods). It also makes the compiler more complex and heightens the danger of matching methods with the same names but different semantics.

This is a short-ish post. I’m sure there are many details I have omitted.

What do I want?

I don’t know. I originally wanted duck typing. Now I am somewhat undecided. I do think Frankenstein impls (something else I originally wanted) are bad (because of the instance coherence problems I alluded to). I think if we make the syntax for “reusing” existing methods to implement a new iface sufficiently compact, it’s probably not so painful. I am not really worried about semantic mismatches: these are rare in dynamically typed languages, and we have types and other checks that make such a mismatch unlikely.

Rust’s Object System

On the rust-dev mailing list, someone pointed out another “BitC retrospective” post by Jonathon Shapiro concerning typeclasses. The Rust object system provides interesting solutions to some of the problems he raises. We also manage to combine traditional class-oriented OOP with Haskell’s type classes in a way that feels seamless to me. I thought I would describe the object system as I see it in a post. However, it turns out that this will take me far too long to fit into a single blog post, so I’m going to do a series. This first one just describes the basics.

One caveat: I think that these techniques are novel, at least in some parts. However, I am not well-versed in the Haskell literature and it’s possible that the techniques we aim to implement have been explored already. If so, I’d appreciate it if someone would point me in the right direction! There are some links in his post that I haven’t read, for example, but I will definitely put them on my reading list.

EDIT: It’s a bit unclear what I precisely think is novel. In fact, when I wrote the previous paragraph, I was referring to our proposed technique for enforcing instance coherence. However, I didn’t even describe this problem in this post, because I realized there was a lot of background to cover. So, to be clear, I don’t think that the basics in this post are terribly novel—with the exception of our use of the same interfaces to unify Haskell-style type-classes (or C++ concepts, if you prefer) with OOP-style existential (sub)typing. That particular part works out quite well, I think.

The building block: ifaces

The fundamental building block of Rust’s OOP system is the iface (interface). As in Java and other languages, an iface is just a set of methods without implementations. Let’s use the example of a hashable value, which might be suitable for use as the key in a hashtable:

iface hashable {
    fn hash() -> uint;
    fn eq(t: self) -> bool;
}

This interface provides two methods. The first, hash(), computes a hash of the value and the second compares for equality. You see that an iface can use the special type self. The type self means “the same type as the receiver”. A later post will demonstrate that this type—while extremely useful!—introduces some complications.

Classes

Classes are like a pared down version of the classes you will find in other languages. As in C++, they have fields, methods, constructors and an optional destructor. However, they do not inherit from one another (we will see how to do polymorphism in a bit). You can define a class like so:

class a_class {
    let x: int, y: uint;

    new(x: int, y: uint) {
        self.x = x;
        self.y = y;
    }

    fn get_x() -> int { self.x }
}

The precise syntax will probably change (I am not fond of the definition of constructors, in particular), but the basic idea will remain the same: a class combines a set of fields with various methods. Members can be defined as private or public with the usual, C++- or Java-like definition. Fields can be immutable (the default) or mutable (let mut x: int).

Polymorphism using classes and ifaces

There is no subtyping between classes. However, sometimes you would like to have a routine that operates on multiple types. The canonical example is to have an interface for “drawable” things like:

iface draw {
    fn draw(gfx: graphics_context);
}

Along with various drawable shapes like:

class square { fn draw(gfx: graphics_context) { ... } }
class circle { fn draw(gfx: graphics_context) { ... } }
...

Rust then offers you two ways to work with these drawable things. The first, interface types, is more like C++ or Java. The second, bounded type parameters, is more like Haskell’s type classes. As we will see, each technique is useful for different scenarios.

Interface types

As in Java, an interface like draw also has a corresponding type (simply written as draw). In fact, it has a family of types (draw@, draw~, draw&, and draw) just as with function pointers, but for now there is no need to get into the full details. The type draw will suffice.

The type draw means “some value which implements the drawable interface”. We can use the draw type to write a function which takes a vector of drawable things and draws them all:

fn draw_all(gfx: graphics_context, drawables: [draw]) {
    for drawables.each {|drawable|
        drawable.draw(gfx)
    }
}

This looks pretty close to Java or C++. However, what happens at runtime is somewhat different in some pretty important ways. For one thing, the draw type in Rust is represented as the pair of a pointer to the instance data along with a vtable. Invoking the draw method, therefore, is simply a matter of extracting the function pointer from the vtable and invoking it with the instance data as the (implicit) first argument.

This representation is somewhat different from Java or C++, both of which would have a single pointer to the object and would embed the vtable in the object itself. There are a variety of reasons that we take a different approach which I will cover later.

The reason I am talking about how draw instances are represented at runtime is that it is not the same as the way that a @circle instance (for example) is represented. The type @circle is just a pointer to the a block of memory containing the fields for the class circle. There is only a single pointer and there is no vtable. So we cannot simply interpret the type @circle as a draw instance without doing some conversion.

In Rust, this conversion is accomplished by casting the @circle instance to the draw type. So, an example of using the draw_all method might look like:

fn draw_a_square_and_a_circle(gfx: graphics_context) {
    let s = @square(...);
    let c = @circle(...);
    let objs = [s as draw, c as draw];
    draw_all(gfx, objs);
}

Here you can see that to construct the vector of drawables, we first casted s and c to the type draw. This cast constructs the pair of the s and c pointers along with the appropriate vtable (in the first case, one for square, in the second case, one for circle).

Why is it designed this way?

There are a variety of reasons that we took a different approach from that used in Java or C++. First, we wished to preserve the nice quality of C++ that all virtual calls are implemented using simple vtables: this is an efficient technique with reliable performance. In Java, in contrast, the precise implementation of interface calls can vary. Of course the JIT is able to generally produce efficient code (typically using PICs or similar things) but we want to be able to statically compile Rust without the need for just-in-time techniques.

However, we also did not want to require that classes be pre-declared as “implementing” a particular interface (or, in the case of C++, extending the given abstract class). In C++, the subtyping relationship is used to guide the construction and layout of the vtables (and, in some cases, multiple such vtables may be needed, meaning that there is no unique pointer to the object data itself). Without having that pre-declared relationship, we cannot pre-compute the vtable(s) for an object in advance.

Therefore, we instead wait and lazilly construct the vtable at the point of the cast (actually, there will be one vtable for each class-iface pair that appears within a crate). By representing the draw instance as the pair of the instance data with the vtable, we can easily have one class instances associated with any number of vtables.

Type classes

There are two fundamental approaches to writing polymorphic functions (in general, not just for interface types). The Java and C++ technique, which we illustrated in the previous section, is to use subtyping. Another approach, pioneered in functional languages (though it is also available in OOP languages) is to use parametric (or “generic”) functions. For example, we could write a function draw_many like so:

fn draw_many<D:draw>(gfx: graphics_context, drawables: [D]) {
    for drawables.each {|drawable|
        drawable.draw(gfx)
    }
}

draw_many() looks very similar to draw_all. It declares a type parameters D and says that the type D must implement the draw iface. This draw interface is called the bound of the type parameter D, because it bounds (or “puts a limit”) on what types can be used for D: they must be types for which the interface draw is available. It then takes a vector of D instances and iterates over its contents, invoking the draw() method on each value.

There is in fact a subtle different between draw_all() and draw_many(). draw_all() took a vector of type [draw]: this means that each entry in the vector may in fact correspond to a distinct kind of drawable thing. For example, the vector might have a square and a circle, as we saw. draw_many(), in contast, takes a vector of type [D]. This means that the type D could be a square (which is drawable) or it could be a circle (which is also drawable), but you cannot have a vector containing both a square and a circle.

To see more closely why this is, consider that at runtime we implement generic functions like draw_many() by following the C++ approach: that is, we duplicate the function for each type that it is used with. Therefore, we can easily create a version of draw_many() for squares by substituting square for each use of the type D:

fn draw_many<square>(gfx: graphics_context, drawables: [square]) {
    for drawables.each {|drawable|
        drawable.draw(gfx)
    }
}

We can also create a similar one for circles, but there is no type (other than draw) that we could use to create a version that accepts a vector containing both circles and squares. In fact, there can be no such vector: all vectors must contain instances of a single type.

Using the type-class style of implementation is generally more efficient than the traditional OOP-style, because it produces no vtables at all (but it does produce more code, which has its own inefficiencies). This efficiency comes at the price of less flexibility, because the style cannot deal with heterogeneous collections.

Actually, this is not strictly true: it is (usually) allowed to instantiate the type D with an iface type, so we could still invoke draw_many() with a vector of draw instances, just as we did with draw_all(). This would be equally (in)efficient as the OOP version, because all method calls would still go through a vtable.

Code reuse via traits

Inheritance is often used as a means of achieving code reuse in OOP languages. While it can be convenient, this is generally regarded as unfortunate, because it ties together the subtyping relationship with details about code reuse. A more modern approach is to make use of traits. Rust offers traits but I won’t go into detail here. In effect, traits allow you to factor out common method implementations in a much more flexible way than inheritance, without introducing the complications of traditional multiple inheritance.

Impls

So far, the only way to define a value with a method is to define a class and include the method in the class definition. This is too limiting, however, in two ways. First, sometimes we want to define methods outside the class body—for example, to extend a class defined in one crate or module from somewhere else. Second, not all types in Rust are classes (for example, ints and vectors) and we don’t want them to be, for efficiency reasons and C compatibility.

To address these two needs we allow you to define methods for a given type using the keyword impl. For example, suppose we want to add a method bounds() that computes a bounding rectangle for a shape. You might do something like this:

impl bounds for square {
    fn bounds() -> rect;
}

Here the syntax impl N for T defines a suite of methods named N for the type T. You can also associate an impl with an iface like so:

iface bounds {
    fn bounds() -> rect;
}

impl of bounds for square {
    fn bounds() -> rect;
}

In this case, the name of the method suite is (by default) the name of the iface. The full syntax is impl N of I for T.

Using an impl, we can generalize interfaces to apply to arbitrary types. For example, we could implement the draw interface for a uint (whatever that means):

impl of draw for uint {
    fn draw(gfx: graphics_context) { ... }
}

Then a [uint] could be passed to draw_many(). Similarly, we could cast a uint to draw.

Scoping of impls

In order to make use of the methods in an impl, you must bring the impl into scope using an import statement. This is where the impl name comes into play. So, to use the bounds method from another module, I must include something like:

import B::bounds;

where B is the module containing the impl declarations. The same visibility rules apply when trying to cast a type to an iface or use the type as the value for a bounded generic type parameter.

Mismatches

To some extent, the class and impl system were independently designed, and there are a few mismatches (mostly in code that has not been fully implemented). The main one is that interfaces are duck-typed (not declared) and impls declared when they implement an iface. We will align these to be the same (for the moment, probably initially by adding the ability to declare an interface when you declare a class).

For Loops

First off, I want to welcome Brian Anderson to the Rust blog-o-sphere (which so far consists primarily of myself). His first post does a great job of explaining how to use the new for syntax that was recently added to Rust: this syntax allows for break, ret, and cont from within user-defined loops, which is very nice.

Reading some of the Hacker News comments (this one in particular), I wanted to clarify one thing. There is some concern that this new syntax changes the semantics of ret when, in fact, it aims to do precisely the opposite.

The goal is that ret always returns from the innermost enclosing fn() declaration. Sugared closures (e.g., {|x| ...}) do not count as an fn-declaration, but a closure written out with fn() does. If you use ret from a context where the compiler cannot generate a return from the innermost enclosing fn() declaration, a static error results.

Here are some examples of what I mean. First, the basic for loop:

fn foo() {
    for each(v) {|e|
        ret e; // returns from foo()
    }
}

Here I am using an fn@() closure:

fn foo() {
    let bar = fn@() -> T {
        for each(v) {|e|
            ret ...; // returns from bar()
        }
        ret ...; // returns from bar()
    };

    ret ...; // returns from foo()
}

and here is an example where an error results:

fn foo() {
    iter(v) {|e|
        ret e; // should return from foo(), but cannot
    }
}

Note that returning out of sugared closures like {||...} is only allowed in the context of a for loop, since it requires additional code generation to support.

Servo Design

Yesterday we had a hackathon/meeting to discuss the overarching design of Servo, the project to build a next-generation rendering engine. We didn’t ultimately do much hacking (though we did a little), but mostly we tried to hammer out the big picture so that we can actually get to writing code. I wanted to try and write up what I understood as the consensus (for the moment, anyway).

The big picture

There will be (at least) three large components. Each is basically operating in independent tasks and the various stages are therefore largely isolated from one another and able to execute independently (with certain exceptions, as we shall see):

  • JS
  • Layout
  • Painting

There are several data structures that will be maintained by these different stages:

  • The DOM
  • The “Layout Tree” (CSS boxes corresponding to each DOM element)
  • The “Display Tree” (what to draw at each location)
  • Various other structures:
    • backing store(s) for canvas etc.

I’ll go over each in turn.

The DOM and Layout Tree

The most interesting—and complex—part of the design centers around the representation of the DOM. We want the ability for layout to execute in parallel with the JS itself. However, both layout and JS require access to the DOM; and, of course, the JS may choose to modify the DOM at any time, and those changes should eventually be reflected in the layout. Initially the plan was to overcome this by having two DOMs: the main DOM, accessible to JS, and the shadow DOM, accessible to layout. The shadow DOM would be kept up-to-date by messages from the JS. The problem with this plan is simply overhead: based on our own experiments as well as feedback from Ben Lerner, we decided this is not the best approach.

An alternative that we are considering instead is what we call the RCU approach. The name derives from the read-copy-update pattern used extensively in the Linux kernel. The idea itself was also inspired by the work on Concurrent Revisions by Burckhardt et al. at MSR.

In a nutshell, the idea of the RCU plan is that when the JS node kicks off a layout task, it will preserve the version of the DOM that the layout is reading. So any changes that occur while layout is active must take place on a copy of the DOM. Of course, it would be too expensive to do a deep copy of the DOM when layout activates, and traditional persistent data structures like maps and vectors are are not much help either.

One key ingredient for any RCU-like plan is that it must be possible to know when readers are active. It turns out that we should be able to track this for layout and JS. Basically, the JS task is the “driver”: it decides when to start layout and may, in some cases, have to block waiting for layout to terminate.

You can think of the main JS task as operating in a loop something like this:

layout_active = false;
dirty_nodes = NULL;
loop {
   execute_JS();

   if (dirty_nodes) {
       if (layout_active) {
           join_layout();
       }
       spawn_layout();
       layout_active = true;
   }
}

It can also happen that the JS requests the computed style information or layout. In this case, then JS must first join the layout task (and, if the tree is dirty, it may have to spawn the task too!).

Our plan instead is to replace each pointer to a DOM node (node*) with a handle (rcu<node>*). This handle will be a structure like the following:

struct rcu<T> {
    T *wr_ptr;
    T *rd_ptr;
    rcu<T> *next_dirty;
};

The wr_ptr points at the current version of the node, whereas the rd_ptr points at the version of the node that layout is operating on. At the moment when a layout task is spawned, rd_ptr and wr_ptr are always the same. Whenever JS wishes to make modifications and layout is active, it follow an algorithm something like this:

void dirty(rcu<T> *handle) {
    if (handle->wr_ptr != handle->rd_ptr)
        return; // already dirty
    if (!layout_active)
        return; // doesn't matter

    handle->wr_ptr = new T(*handle->rd_ptr); // copy rd data
    handle->next_dirty = dirty_nodes;
    dirty_nodes = handle;

    return;
}

After this, it is safe for the code to make changes to the contents of handle->wr_ptr.

The final step is to reset the rd_ptr to the wr_ptr. This occurs once layout is completed. For example, we might implement the join_layout() routine like so:

void join_layout() {
    layout_task->join();

    // Reset read and write pointers:
    rcu<T> *p = dirty_nodes, *pn;
    while (p != NULL) {
        pn = p->next_dirty;
        p->rd_ptr = p->wr_ptr;
        p->next_dirty = NULL;
        p = pn;
    }
    dirty_nodes = NULL;
}

Note: the small details of this implementation will probably change. For example, it might be better to store the dirty_nodes in a vector instead of a linked list, or at least pull the next field out somewhere else (this would for example make sense if the proportion of dirty to clean nodes is small, as expected). But you get the idea (I hope, anyway).

So now that we’ve explained the basics, let’s look at a few variations.

Separating layout into phases

I described layout as one monolithic entity. But in fact it can be useful to separate it into multiple parts. For example, some JS calls require that style computation be completed, but do not require that the actual layout boxes be computed nor that the geometry is complete. Therefore, we can break the layout task into multiple tasks, allowing the JS to join just the phase that it requires (as well as allowing it to spawn a task which will only perform the style computation and so forth.

Triggering layout at other times

For things like CSS animations, we would like to be able to trigger an animation even while the JS is active. We can do this without great difficulty thanks to the periodic callback which the JS makes every N operations or so. Basically, when the animation is ready to begin the next layout step, it will asynchronously set a flag (animation_requires_layout). During the JS callback, if layout is inactive but animation_requires_layout is true, then it will spawn off a layout task. Any writes which occur after that point will have to be RCU’d.

One issue with this which I can see: the layout task will see whatever DOM modifications had occurred up until the point of the interrupt. This doesn’t seem immediately desirable to me. It could be circumvented by tracking dirty nodes even when layout is inactive, and just resetting the rd_ptrs every turn of the JS event loop.

Other stuff

We have to be careful around the backing buffers for Canvas layers and other such data structures. This doesn’t seem especially hard but we’ll want to think about it. Most likely Canvas will need to be double-buffered and we’ll just swap the buffers at the same point we adjust the rd_ptr (when you think about, the RCU scheme is basically double-buffering for the DOM).

Memory management

Writing this up has brought some questions to mind. Primarily my concerns center around memory management. Garbage collection operating in the JS task while layout is active will have to be quite careful. It can safely trace through both the rd and wr ptrs for DOM nodes, but if there are links from the DOM nodes to the computed style and layout information (which we had thought to have), then it is not safe for the GC in the JS task to look at those. The layout may be concurrently modifying them after all. There is also the matter of managing the memory for the layout data structures.

One solution is for GC to simply join the layout task before it begins execution. Or, similarly, to distinguish small collections—in which we ignore layout data structures—from large collections, in which case layout must be joined. This is probably good enough.

Painting

When layout finishes, it can perform a paint by building up what is now called a display tree—basically a list of rectangles to draw and their contents—and send this off to the display task. The display task is then charged with walking this tree, rasterizing its contents, and blitting the data to the screen. This process can be done in a very parallel way using rather simple techniques (blit any non-overlapping rectangles in parallel, etc). It can also use simple caching to avoid expensive rasterizations, as Gecko does today. In short, it seems fairly straightforward.

The plan

We hope to quickly build up a fairly rudimentary form of Servo based on this architecture. The layout algorithms themselves will probably be initially implemented in a sequential fashion. We can still get quite a lot of simple parallelism from various pieces of low-hanging fruit: selector matching, painting, etc. And we get quite a bit of pipeline parallelism and responsiveness by separating the various tasks. But eventually of course we hope to parallelize the layout pipeline itself.

One important point which bz raised is that sometimes the raw performance of layout is not terribly important—but it is important that the browser stay responsive. The fact that Gecko must do layout on the main thread harms responsiveness, something which we should be able to avoid.

A final note

One disappointing aspect of this plan is that the existing static data-race verification techniques are so inadequate to the problem. Actor-based solutions require total separation of the DOM trees, leading to unacceptable overhead. Simple parallelism like that offered by painting will likely never be analyzable by any simple static regimen: it would have to be able to reason about the fact that painting two rectangles which do not overlap is a commutative operation. The RCU-like plan would of course be very hard to statically analyze, though if you build something like that into your language as a base abstraction—as with the Concurrent Revisions work—that might work out well.

In general though I believe that a balanced approach to race detection is best: statically verify where you can, accept the limited use of dynamic schemes otherwise. And we should be able to statically verify simpler subproblems, for example, ensuring that layout only reads data that is reachable via the rd_ptr and that JS only writes to data reachable via the wr_ptr (this can be solved by a simple ADT which only grants access via one pointer or the other).

Avoiding Region Explosion in Rust

pcwalton and I (but mostly pcwalton) have been hard at work implementing regions in Rust. We are hoping to use regions to avoid a lot of memory allocation overhead in the compiler—the idea is to use memory pools (a.k.a. arenas) so that we can cheaply allocate the data needed to process a given function and then release it all in one shot. It is well known that arenas are great fit for the memory allocation patterns of a compiler, which tend to produce a lot of data that lives for the duration of a pass but is not needed afterwards.

In any case, recently we had a discussion about how we can use regions in the trans pass of the compiler: this is the pass which converts from our internal representation (IR) to the LLVM’s IR. I thought it was worth sharing the result of this discussion. The basic summary is that we are able to make use of region subtyping to accommodate a fairly complex pattern of lifetimes with very little annotation overhead.

The setting: contexts in trans

First, let me introduce the problem: during translation, we produce a lot “contexts”, which store needed data about the state of the translation. For our purposes, there are three contexts of note: the crate context, or ccx, which stores crate-wide data such as linkage information about top-level functions, constants, and so forth; the function context, or fcx, which stores per-function data such as references to the LLVM variables representing its parameters and locals; and finally the block context, or bcx, which stores information about a single basic block in the control-flow graph.

What we would like to do is to create the crate context ccx on the stack when we enter the translation phase for the crate as a whole. Later, when we begin to translate a given function, we will allocate its function context fcx on the stack as well. The block contexts, however, are a little different: they do not fully obey a stack discipline. That is, it is common for a function to create a new block context and return it to its caller, perhaps with a signature like the following:

fn compile_if_then_else(bcx0: @block_ctxt,
                        cond: @expr,
                        then_blk: @code_block,
                        else_blk: @code_block) -> @block_ctxt

This function would presumably generate the diamond-shaped if-then-else pattern. The initial block is the block represented by bcx0. The function will compile the condition cond and generate branch to one of two new basic blocks representing the true and false paths. The code might look something like this (note: this is not the actual code in rustc, which is naturally much messier):

let (bcx1, val) = compile_expr(bcx0, cond);
let mut bcx_true = new_bcx(bcx0.fcx);
let mut bcx_false = new_bcx(bcx0.fcx);
add_instr(bcx1, if(val, bcx_true, bcx_false));

The then and else blocks could then be compiled in the contexts of those true and false blocks:

bcx_true = compile_block(bcx_true, then_blk);
bcx_false = compile_block(bcx_false, else_blk);

And finally the two paths can be merged into a new block, which is the block that gets returned:

let bcx_join = new_bcx();
add_instr(bcx_true, goto(bcx_join));
add_instr(bcx_false, goto(bcx_join));
ret bcx_join;

The problem: expressing context lifetimes with regions

Let’s dig a bit more into the representation of these contexts. The details aren’t too important but I want to focus on the region-related aspects that describe their lifetimes. Remember that there is a crate context ccx that is valid for the translation of the entire crate. Its contents are not important, so let’s just assume it’s some record:

type crate_ctxt = {
     ...
};

Then there is a function context. It contains a pointer to the crate context, along with some other stuff:

type func_ctxt = {
    ccx: &crate_ctxt,
    ...
};

Finally the block context, which contains a pointer to the function context:

type block_ctxt = {
    fcx: &func_ctxt,
    ...
};

Here I have shown the pointers as region pointers, but I haven’t written any explicit region annotations. The question is, what regions should we associate with those pointers?

The maximally expressive approach

If you wanted to take the maximally expressive approach, you would wind up with a lot of region parameters. For now I will show this in a very explicit syntax in which types are given explicit Region parameters, but this syntax is not valid Rust and (hopefully) never will be:

type crate_ctxt = {
     ...
};

type func_ctxt<&c> = {
    ccx: &c.crate_ctxt,
    ...
};

type block_ctxt<&f,&c> = {
    fcx: &f.func_ctxt<&c>,
    ...
};

You can see the problem. The type for the block context must be annotated with two region parameters, one to describe the region of the function context and one for the crate context.

In this technique, if we have a variable bcx of type &b.bcx<&f,&c>, then bcx.fcx.ccx will have type &c.ccx: the precisely correct region, presumably.

For reference, the signature of compile_if_then_else() would become:

fn compile_if_then_else(bcx0: &b.block_ctxt<&f,&c>,
                        cond: @expr,
                        then_blk: @code_block,
                        else_blk: @code_block) -> &b.block_ctxt<&f,&c>

The minimally expressive approach

The approach we plan to take is much simpler. Types do not have region parameters. Instead, when we instantiate an &T type to a specific region, the outermost & in a function prototype is assigned a fresh region, but & which appear within that type are assigned to this same fresh region. This means that if we have a variable bcx with type &b.bcx, then bcx.fcx.ccx will have type &b.ccx: this is an underapproximation of the lifetime of the crate context. The true lifetime is &c which is some superset of &b. The reason that this whole scheme type checks is because of the subtyping relationships between region pointers: a reference with a longer lifetime (like &c.ccx) can be used wherever a reference with a shorter lifetime (like &b.ccx) is expected.

Under this approach, the signature of compile_if_then_else() becomes:

fn compile_if_then_else(bcx0: &b.block_ctxt,
                        cond: @expr,
                        then_blk: @code_block,
                        else_blk: @code_block) -> &b.block_ctxt

Not so bad.

Arenas and placement new

One question remains: because the lifetime of block contexts is not bound by the call stack, how can we manage their allocation without resorting to heap allocation (the function context and crate context can be allocated on the stack)? The answer is that we will use arenas.

An arena is basically a pool of memory in which we can allocate lots of data and then release the pool all in one shot. This is very cheap but only suitable for places where allocation follows a “phase-based” pattern.

We will use a memory pool which is allocated and released per-function. Therefore, the pool itself will be stored in the function context:

type func_ctxt = {
    ccx: &crate_ctxt,
    pool: &memory_pool,
    ...
};

In the current Rust type system, anyhow, a memory pool can be any type for which there exists an impl offering an alloc(sz: uint, align: uint) -> *() method, which allocates sz bytes of memory at the given alignment and returns a pointer. An expression like new (pool) value will cause pool.alloc() to be invoked and will then store the value into the memory location that was returned. The result is a region pointer in the same region as the pool itself.

This means that allocating a new block context looks something like:

fn new_bcx(fcx: &f.func_ctxt) -> &f.func_ctxt {
    new (fcx.pool) {fcx: fcx, ...}        
}

The Summary

The basic idea of the approach is to retain less information. For a given region pointer p, all you know is that any data reachable via some path like p.f.g.h will be live as long as p is live. It seems that this is enough in practice for most real use cases. Time will tell, I suppose.

Progress on Inlining

Cross-crate inlining has come a long way and is now basically functional (I have yet to write a comprehensive test suite, so I’m sure it will fail when exercising various corners of the language).

Just for fun, I did some preliminary micro-benchmarks. The results are not that surprising: removing method call overhead makes programs run faster! But it’s still nice to see things go faster. We’ll look at the benchmarks, see the results, and then dive into the generated assembly. In all cases, I found LLVM doing optimizations that rather surprised me.

How to use it

Actually, rustc has been doing inlining without any special annotations for a long time—but only within one crate. If you want to enable a function to be inlined when called from another crate, you simply have to add an #[inline] annotation to it, like so:

#[inline]
fn range(lo: uint, hi: uint, it: fn(uint)) {
    let i = lo;
    while i < hi { it(i); i += 1u; }
}

This is the uint::range() function, which simply invokes its argument on every integer in a particular range.

The reason that an annotation is required to inline calls to functions in other crates is that cross-crate inlining complicates the recompilation model. Normally, crates are dynamically linked, so if you change the implementation of a function but not its type signature, then there is no need to recompile dependent crates or programs. However, if an inlined function is changed, then every caller must be recompiled in order to observe that change, as the source of that function will have been inlined into their local compilation units (of course, if the inlined function is not exported or not used, then there is again no need to recompile dependent crates).

The #[inline] directive currently takes one option: you can write #[inline(always)]. The difference is that the former is a hint, which the compiler may choose to ignore. The always directive makes the hint stronger, causing the compiler to ignore the typical heuristics and thresholds that it uses to decide when to inline. Currently, these hints are passed on directly to LLVM; unfortunately, I have found that if you do not write #[inline(always)], LLVM almost always chooses not to inline, so probably we have to adjust the heuristics somewhat for Rust code.

Benchmark #1: uint::range

uint::range is Rust’s way of iterating over a range of integers. The following simple program simply sums up the integers from 0 to N, where N is provided on the command line:

fn main(args: [str]) {
    let r = option::get(uint::from_str(args[1]));
    let sum = 0u;
    uint::range(0u, r) {|i|
        sum += i;
    }
    io::print(#fmt["Sum from 0 to %u is %u\n", r, sum]);
}

Before inlining, this program would literally create a stack closure for the body of the loop and pass it to the library function range (the source of which was shown above). Range would then iterate and invoke the closure on every iteration.

We’ll look at the generated assembly shortly. But first, let’s see some simple performance measurements:

; rustc -O --inline --monomorphize ~/tmp/iterator.rs
; time ~/tmp/iterator 10000000000
Sum from 0 to 10000000000 is 13106511847580896768

real    0m0.016s
user    0m0.010s
sys 0m0.006s
; rustc -O ~/tmp/iterator.rs -o ~/tmp/iterator-no-inline
; time ~/tmp/iterator-no-inline 10000000000
Sum from 0 to 10000000000 is 13106511847580896768

real    0m48.217s
user    0m48.203s
sys 0m0.014s

As you can see, the inlining optimizations are still not enabled by default (at least on my machine, compilation does succeed with inlining enabled (or it did when I last tested it), but I am still not happy with the auto-generation of the serialization code and so I did not want to have the main build of the compiler depend on it yet). However, there is a big difference between the inlined and non-inlined version of this benchmark! The non-inlined form took about 3013 times as long! We’ll see why this is when we dig into the generated assembly. The reasons surprised me a bit.

Generated assembly

A (somewhat simplified and annotated) extract of the generated assembly for the uint::range() example is below. Actually, LLVM is amusingly both extremely smart and kind of dumb here. The actual computation of the sum has been removed and turned into an algebraic formula. After that formula is computed, then there is a useless little while loop that just iterates from 0 to n doing nothing:

  ...
Ltmp3:
  ; initialize sum to 0u
  ; and branch out if `r` is 0
  movq    $0, -56(%rbp)
  movq    -48(%rbp), %rcx
  testq   %rcx, %rcx
  je      LBB0_9

  ; compute (r*(r-1)) / 2
  ; (closed form of summation)
  ; and store into %rdx
  leaq    -1(%rcx), %rax
  leaq    -2(%rcx), %rdx
  mulq    %rdx
  shldq   $63, %rax, %rdx
  addq    %rcx, %rdx

  ; loop r times doing nothing
LBB0_7:
  decq    %rcx
  jne     LBB0_7

  ; store final result of summation
  ; and move on
  decq    %rdx
  movq    %rdx, -56(%rbp)

LBB0_9:
  ...

Benchmark #2: vec::iter

Well, that benchmark was fun but since LLVM got so smart it’s not as interesting as I’d like. So I wrote up another one that uses vec::iter(). This will also have the added benefit of showing off Marijn’s work on monomorphization, which optimizes our treatment of generic functions. The example is basically the same as the previous one, but it uses vectors:

fn main(args: [str]) {
    let r = option::get(uint::from_str(args[1]));
    let v = vec::enum_uints(0u, r);

    let start = std::time::precise_time_s();

    let sum = 0u;
    vec::iter(v) {|i|
        sum += i;
    }

    let end = std::time::precise_time_s();
    io::print(#fmt["Sum from 0 to %u is %u\n", r, sum]);
    io::print(#fmt["time: %3.3f s\n", end - start]);
}

Unfortunately, the time to execute is largely dominated by building up the vector of integers we’re going to iterate over, so I added some measurements of the time spent iterating to get a better idea of the effects of inlining.

Before we dig into the generated assembly, let’s look at the measurements:

;rustc -O --inline --monomorphize ~/tmp/iterator_vec.rs
;~/tmp/iterator_vec 100000000
Sum from 0 to 100000000 is 5000000050000000
time: 0.140 s
;rustc -O ~/tmp/iterator_vec.rs -o ~/tmp/iterator_vec-no-inline
;~/tmp/iterator_vec-no-inline 100000000
Sum from 0 to 100000000 is 5000000050000000
time: 1.183 s

Woohoo, the non-inlined version took 8 times longer. That’s satisfying. More satisfying, in a way, than the 3000x improvement from before, since it suggests we’re doing things better but not just winning by a kind of trick.

(Sharp-eyed readers may have noticed that the results of the summation are different than before. This is because vec::enum_uints() generates a vector of i such that 0 <= i <= N whereas uint::range() explores the range 0 <= i < N. Yay for consistency.)

Defining vec::iter

Before we look at the assembly, let’s see how vec::iter() is defined:

#[inline(always)]
fn iter<T>(v: [const T], f: fn(T)) {
    unsafe {
        let mut n = vec::len(v);
        let mut p = unsafe::to_ptr(v);
        while n > 0u {
            f(*p);
            p = ptr::offset(p, 1u);
            n -= 1u;
        }
    }
}

This implementation makes use of pointer arithmetic contained within an unsafe block. It’s basically equivalent to the following C++-ish code:

template<class T>
void iter(vec<T> vec, void (*f)(T&)) {
    n = len(vec);
    T *p = data(vec);
    while (n > 0) {
       f(*p);
       p += 1;
       n -= 1;
    }
}

Generated assembly

OK, now let’s look at the assembly. We’ll see that we’re generating pretty decent code. One thing that could perhaps be improved is that the call to unsafe::to_ptr() does not appear to have been inlined despite the fact that its definition is marked as #[inline(always)]. Note sure why that is. Another thing (which may be related) is that p is not stored in a register but rather loaded on each iteration from the loop. But I’m not sure how significant that is when the effects of caching and so forth are taken into account.

One interesting thing is that LLVM converts the loop from one which counts down to a loop which counts up. It does this by first negating n. I’m not sure why this should be faster, I guess that it lets you generate more compact instructions somehow or perhaps enables other optimizations later on. Can’t say I’ve ever looked into these kind of micro-optimizations around loop counters in detail.

Ltmp7:
    ; Initialize sum to 0:
    movq    $0, -80(%rbp)

    ; let n = vec::len(v);
    movq    -64(%rbp), %rdx
    movq    (%rdx), %rbx

    ; Compute p and store it into -48(%rbp)
    ; (Note: first argument to `unsafe::to_ptr()`
    ;  is the location to write the output)
    leaq    -48(%rbp), %rdi
    callq   __ZN3vec6unsafe8to_ptr1217_f332097e13dd07e5E

Ltmp9:
    ; Convert size from bytes into indices:
    shrq    $3, %rbx
    testq   %rbx, %rbx
    je  LBB0_12

    ; Convert counter to -n:
    negq    %rbx

    ; Zero out the sum, which will be held in %eax:
    xorl    %eax, %eax

LBB0_10:
    ; Load *p and add to the sum:
    movq    -48(%rbp), %rcx
    addq    (%rcx), %rax

    ; p++
    addq    $8, -48(%rbp)

    ; n++, stop when we reach zero:
    incq    %rbx
    jne LBB0_10

    ; Move sum from %rax into its home on the stack:
    movq    %rax, -80(%rbp)

LBB0_12:
    ...

Goodbye!

I hope you enjoyed this little dive into our code generation.

Serialization Without Type Information via Impls

My current implementation of the auto-serialization code generator requires full type information. This is a drag. First, macros and syntax extension currently run before the type checker, so requiring full type information prevents the auto-serialization code from being implemented in the compiler, as it should be. At first I wanted to change how the compiler works to provide type information, but after numerous discussions with pcwalton and dherman, I’ve come to the conclusion that this is a bad idea: it requires exposing an API for the AST and for type information and introduces numerous other complications.

I’ve come up with an alternative design that seems to solve this problem. It also addresses another concern I had: how do you allow users to customize the (de)-serialization for a given type without forcing them to customize (de)-serialization for all types? One interesting aspect of this plan, though, is that it requires non-hygienic macros.

My basic plan is to allow type declarations to be decorated with a tag like #[auto_serialize], which will look something like this:

#[auto_serialize]
type spanned<T> = { node: T, span: span };

Here I have deliberately chosen a generic type declaration to use as my running example because (as we shall see) they are particularly complex. Then a pass will run in the compiler which finds all types annotated with #[auto_serialize] and generates serialization and deserialization code that live alongside the declaration. Let’s look first at serialization and then at deserialization: as we shall see, the solution that we use for serialization doesn’t quite work for deserialization, so we have to handle them slightly differently.

Serialization

My original concern was, without type information, how do I know how to serialize the contents of the type? After all, all I have is the AST, so I know some names but that’s it. In the case of spanned<T>, for example, I know there are two fields, one with the type T and one with the type span. I can figure out that T is a type parameter, but I don’t know that span is an import of syntax::codemap::span, and I certainly don’t know that syntax::codemap::span is defined as a record itself.

So how do I generate code to serialize a type like T or span without knowing anything about what that type is? It turns out that we have a nice language tool for doing that: ifaces and impls (a.k.a., typeclasses).

So, for spanned<T>, I will generate something like:

impl of serializable<T: serializable> for spanned<T> {
    fn serialize<S: serialization::serializer>(s: S) {
        s.emit_rec {||
            s.emit_rec_field("node", 0u) {||
                self.node.serialize(s);
            }
            s.emit_rec_field("span", 1) {||
                self.span.serialize(s);
            }
        }
    }
}

You can see that generating this code does not require any information that is external to the type declaration. It just assumes that, for example, there will be a suitable implementation of the serialize() method for the field self.span. Similarly, by parameterizing the impl with the type T and specifying that T must itself be serializable, we can make the same assumption for the field self.node. Pretty nifty.

One very appealing aspect of this is that if I wanted to make custom serialization code for the type span, say, I could just write my own impl for the serialize method. The auto-generated code for serializing spanned<T> would then link to my custom code, no problem. Similarly, I can write custom code that uses auto-generated code without difficulty.

Deserialization

However, this approach does not work for deserialization. After all, we can’t invoke something like data.deserialize(d), as the data is what we are trying to produce!

Therefore, we will generate a different pattern for deserialization. It will look something like this:

fn deserialize_spanned<D: serialization::deserializer,T>
   (d: D, t: fn(D) -> T) -> spanned<T> {

   d.read_rec {||
       {
           node: d.read_rec_field("node", 0u) {|| t() },
           span: d.read_rec_field("span", 1u) {|| deserialize_span(d) }
       }
   }
}

Here, we generate a deserialize_X() function where X is the (unqualified) name of the type being deserialized. The number of arguments expected by this deserialize_X() function varies: the first argument is always a deserializer, but then there are additional arguments for any type arguments. These parameters are dealt with implicitly when using ifaces and impls, but since that machinery won’t work for us we have to thread it through manually now.

More interesting than the case of the field node, actually, is the field span: here, we don’t even try resolve the identifier, we just generate a dangling reference to a function deserialize_span() and we assume that the user has either imported this function or defined it locally. This is where the lack of hygiene is required.

Some other cases that don’t appear here:

  • if the type of a field is a path like a::b::c, then we generate a call to a function like a::b::deserialize_c(d).

  • if the type of a field is parameterized, like spanned<item_>, then we generate a call like deserialize_spanned(d, {|| deserialize_item_(d) }), where the sugared closure {||...} represents the code to unpack the type argument.

Feedback

I am 100% positive people have solved this problem before in a million ways, no doubt including this one. Am I missing something obvious? Also, would it be better to avoid using iface/impl for serialization and just generate functions named serialize_X() just as I do with deserialize_X()? I thought it’d be nice if the serialization were as natural to write as possible, but I guess that if you have to write custom serialization code, you generally need custom deserialization code too, so it doesn’t help so much.

Regions Tradeoffs

In the last few posts I’ve been discussing various options for regions. I’ve come to see region support as a kind of continuum, where the current system of reference modes lies at one end and a full-blown region system with explicit parameterized types and user-defined memory pools lies at the other. In between there are various options. To better explore these tradeoffs, I wrote up a document that outlines various possible schemes and also details use cases that are enabled by these schemes. I don’t claim this to be a comprehensive list of all possible schemes, just the ones I’ve thought about so far. In some cases, the descriptions are quite hand-wavy. I also think some of them don’t hang together so well.