Writing A Template Language

Temple of Odin is a template language interpreter I've been working on. As of writing, it's at least good enough to create static pages on this very website. I wanted to give an overview of how it works, how the codebase has evolved, and what I've learned from it. The code is open sourced here.

Temple of Odin is a template markup language. This means its sole purpose is to seed data into a text template. Below is an example of ToO's (Temple of Odin) syntax.

Tomorrow I would like to wake up at @($.hour):00am

The seeded data (identified by the dollar sign) has some kind of hour field, probably an integer. The interpreter finds this field and inserts the value into the template. Doing the same thing with a standard C format string might look like this.

printf("Tomorrow I would like to wake up at %d:00am", hour);

The difference between ToO and C format specifiers is that ToO can put logic into the template itself. If we wanted, we could change the template to convert from a 24-hour internal representation to be displayed as a 12-hour clock.

Tomorrow I would like to wake up at @($.hour > 12 ? $.hour - 12 : $.hour):00@($.hour > 12 ? "pm" : "am")

hour	output
7	Tomorrow I would like to wake up at 7:00am
14	Tomorrow I would like to wake up at 2:00pm
21	Tomorrow I would like to wake up at 9:00pm

The ternary expressions return values that get formatted into the text.

ToO also has if-statements and for-loops:

Watch me count to ten! @for i in 1..=10 {@(i)@if i < 10 {,} }done

I only implemented range-based loops, so there's no possibility of an infinite loop. I can't think of any situations where an unbounded loop in a template would be useful anyways. With the loop variable i, we can also see that ToO is capable of storing state within a scope, and it's easy to declare/assign variables too.

@(x := 0)
@(y := 0)
@(w := $.w)
@if true {
	@(x = w)
	@(y := w)
}
@(x), @(y)

This article is not meant to be full documentation of ToO; in fact, I wouldn't recommend using it as it is. That might change as I slowly harden and expand the feature set.

The First Implementation

I'm going into more technical detail now, and I assume you already have baseline knowledge of interpreters. Concepts like abstract syntax trees and tokenizers (aka lexers) won't be explained here.

Sergei Kharni's article, Optimising interpreters: fusion was what spurred me on to this project. His article introduced me to the concept of supernodes, a way to optimize tree walk interpreters by combining node evaluations that often happen together. This process can even be automated through heuristics on code.

Let's say we're adding a constant number to a variable: x + 5. A standard AST might represent this with 3 nodes that need evaluation: a top-level node for an add operation, a node with a constant number, and a node that loads x. If we notice that constants get added to variables a lot, it might be worth it to combine these kinds of operations into one supernode. It's compression that reduces memory usage and the number of nodes that need walking

At the time I read that article, I'd been using Go templates. As convenient and simple as Go templates are, I had some frustrations. Setting up custom callbacks is required for a lot of basic functionality, such as some math operations. I also found that the error reporting left a lot to be desired. Making a mistake could mean part of my template not rendering without any details of what was wrong.

With all that in mind, I decided to roll my own template language using supernodes. I started by doing exactly what I'd done for previous interpreters. I defined a new type for all the different node possibilities, then I hooked them into a tree using pointers. I wrote a function called evaluate_node that recursively solved each node.

Node :: struct {
	variant: union {
		^Source_Node,
		^Access_Node,
		^Implicit_Selector_Node,
		^If_Node,
		^For_Loop_Node,
		^Operation_Node,
		^Unary_Node,
	},
	next:   ^Node,
}

Source_Node :: struct {
	using node: Node,
	str:        string,
}

Access_Node :: struct {
	using node: Node,
	name:       string,
	from:       ^Access_Node,
}

If_Node :: struct {
	using node: Node,
	condition:  ^Node,
	if_true:    ^Node,
	else_if:    ^Node,
}

For_Loop_Node :: struct {
	using node: Node,
	variables:  [2]strings,
	iterator:   ^Node,
	block:      ^Node,
}

Operation_Node :: struct {
	using node: Node,
	lhs:        ^Node,
	rhs:        ^Node,
	op:         Token_Kind,
}

Unary_Node :: struct {
	using node: Node,
	rhs:        ^Node,
	op:         Token_Kind,
}

evaluate_node :: proc(n: ^Node) -> Value {
	switch nvar in n.variant {
		...
	}
}

The root nodes are structured in a linked list using the next variable. Each one gets evaluated in order, then concatenated onto the output text.

Hello, please call me @(nickname)!

Source_Node{str = "Hello please call me "} -> Access_Node{name = "nickname"} -> Source_Node{str = "!"}

The problem with all these types is that every one requires more code to handle it in the evaluation function. On top of that, the whole point of this supernode concept was that I would create more node types to merge the functionality of existing nodes. The code got more bloated with every type, and I was getting annoyed. It wasn't clear how this could go on while keeping complexity reasonable.

The New Implementation

I was very fortunate to be reading Ryan Fleury's user interface articles. His insights on code complexity were exactly what I needed to hear. This passage from The Widget Is A Lie hit me hard, "Attaching a too high-level or top-down name to nothing more than a useful data-organization mechanism—like 'widget'—can often lead us astray. And, indeed, using the 'widget' label did lead me astray for years."

That was exactly what I'd done. I saw a for-loop, and I created a struct called For_Loop_Node with fields for the declaration, iterator, etc. I didn't think seriously about what the for-loop needed to do until after I created that struct. It turns out that basing my interpreter's internal structure off the way the language looks in a text file is a bad idea.

I needed to find a better way to represent my instructions. The main codebase was too big for fast iteration, so I made prototypes to find what worked. It didn't take long for a much more elegant solution to emerge.

Instruction :: struct {
	flags:   bit_set[Instruction_Flag], // Behaviors of the node
	value:   Value,                     // General value to be pushed or used in an operation
	op:      Operator,                  // Specific operator actions
	noderef: int,                       // Node index for jumps
}

// These flags are ordered by the sequence of their respective actions
Instruction_Flag :: enum {
	Insert_Guard_RHS,       // Insert a guard on the right side of stack
	Load,                   // Loads store value from node string. Replaces node value
	Push_RHS,               // Sets pushed values to right side of stack
	Push_Value,             // Push value to stack
	Value_Is_LHS,           // Set LHS to node value
	Peek_LHS,               // Peek value from left side of stack
	Value_Is_RHS,           // Set RHS to node value
	Peek_RHS,               // Peek value from right side of stack
	Swap_Sides,             // Switch RHS and LHS values
	Enter_Scope,            // Enter a new scope
	Leave_Scope,            // Leave last scope
	Flip,                   // Flip latest bool value on stack
	Jump,                   // Jump unconditionally
	Jump_If_False_Or_Error, // Pop last value and jump if it's false or a runtime error
	Write_Out,              // Pop last value and write to output stream
	Consume_Guard_RHS,      // Pop from right side of stack until hitting a guard
}

I wasn't trying to make a bytecode interpreter, but that's what emerged after simplifiying as much as I could. As far as I can tell, I've got pretty much the same conclusion that other interpreter developers had already reached long ago. It's way easier to make a not-so-complex interpreter with bytecode. This isn't some IR that gets generated from an AST either, the parser outputs this directly.

Aside from doing jumps, a stack removes the need for nodes to hold references to each other. For many other languages it can be important to control what sub-trees do and do not get evaluated. For example: func1() && func2() This expression calls func1 first, but if func1 returns false, it won't call func2. This makes a difference if func2 has any side effects. In ToO however, side effects are not possible on an expression level, so we can evaluate both func1 and func2 before knowing about the boolean && operation.

The comments in the code above references a left and right side of the stack, because the stack is actually double-sided. The right-hand side (RHS) and left-hand side (LHS) refer to the beginning and end of the stack respectively. In a statement like x, y := 1, 2, the interpreter can push the names x and y to the left, then 1 and 2 to the right. Now assignment operations can pop from both sides to create the variables.

The best part is how many possibilities are now open to combine nodes. An operator node could store a constant in its value field and use it directly instead of popping from the stack. The Flip flag toggles the last boolean on the stack, so the ! unary operator doesn't need to add any new nodes. All of this can be from trivial additions to the evaluation function.

Benchmarks

Let's see how ToO holds up (in speed) against some other template languages. I chose Go templates, Shopify Liquid, (and not mustache), because they have control flow. I admit, benchmarks are not perfect. I don't have any experience with Ruby, which Shopify Liquid is implemented in. The code for these tests are published here. Please feel free to send me feedback if you find any issues with my methodology.

System76 Lemeur Pro running Pop!_OS 22.04
11th Gen Intel i5-1135G7 @ 2.40GHz x 8
16 GiB of RAM
	Large Texts	Large Texts (Pre-Compiled)	Factorials	Factorials (Pre-Compiled)
Temple of Odin	68.352µs	57.519µs	116.834µs	111.403µs
Go Templates	236.192µs	204.151µs	502.911µs	472.606µs
Liquid	740.4µs	530.43µs	3.06ms	2.75ms

Lenovo Thinkpad T430 running Windows 10 Pro
3rd Gen Intel i7-3632QM @ 2.20GHz x 4
16 GiB of RAM
	Large Texts	Large Texts (Pre-Compiled)	Factorials	Factorials (Pre-Compiled)
Temple of Odin	68.352µs	57.519µs	116.834µs	111.403µs
Go Templates	236.192µs	204.151µs	502.911µs	472.606µs
Liquid	740.4µs	530.43µs	3.06ms	2.75ms

Doing a speed comparison with Ruby code is a little unfair. I will say Liquid is still the most capable in terms of features right now (I'm assuming from their docs because I've never used it seriously). I'm more suprised that Go templates didn't hold up better. Sure, Go has some GC overhead and stuff, but I figured its more minimal syntax and feature set would give it an edge.

Other Details

The Two Tokenizers

The language passes in and out of two states that I call `source` and `statement`. Take a look at this template:

1 + 1 = @(1 + 1)

Before the @ sign, the first '1 + 1 =' is just being interpreted as a string. The @ sign then tells the parser to enter a statement. The other `statement` tokenizer now recognizes the ones as number tokens, the plus sign as its own token, and will ignore the whitespace.

Why do we even need a tokenizer for plain text? Because you can use backslash escape sequences to skip whitespace and insert special characters.

Type System

Normally I would use a very simplified type system for an interpreted language, but Temple of Odin is meant to interop with Odin types, so it inherits Odin's (kind of) strict type system. That being said, a lot of type checking still needs to happen during execution. ToO has no way to know if some expressions are valid until the template is supplied with a seed value.

I am @($.year_born - $.current_year) years old. I sure hope year_born and current_year are the same numeric type!

I employ some tricks to make sure I don't have too much repeated code for all the different types. For example, all integers are 64 bit and unsigned under the hood. That is until they need to do something where being signed actually matters. For integers under 64 bits, there's some additional logic that properly masks off the upper bits.