A Tiny Compiler in Zig

Oct 28, 2024

I love learning new stuff. This year, I picked up Kotlin to write a compiler in a university compiler design course because I didn't want to do another project in Java. Don't get me wrong, Java is great with its vast ecosystem and magical JVM, but I like to pick up new things where I can. Over the summer, I had also used Go quite a bit upon learning more about its design philosophies.

Recently, after following its developments for a while, I decided to try Zig in a moderately sized project. I found quite a lot of satisfaction during the compiler design course I had attended, so I made up a new tiny language and wrote a compiler in Zig. For code generation, I generate Python code just for simplicity.

Goals:

  • Learn Zig syntax and semantics.
  • What are Zig's strengths?
  • Build something fun.

In this piece, I'm going to use my tiny compiler to highlight some things I learnt that were interesting, as well as things I will miss deeply as I work in other languages.

Brief introduction to Zig

Zig iguana mascot.

"Zig is a general-purpose programming language and toolchain for maintaining robust, optimal, and reusable software."

I'm not a systems programmer by any means. I've used C/C++ before and I've also dabbled in Rust, but I find myself using Python or Go for most tasks due to simplicity. Zig is an interesting language though, as it has a rather simple language, while also providing control over memory and other powerful features.

One highlight feature would have to be comptime, which is Zig's approach to metaprogramming, offering compile time code execution and evaluation.

Another standout feature is its interop with C/C++, allowing Zig to use C libraries (and vice versa!) with ease. It can also be used as a replacement C/C++ compiler providing cross-compilation and a build system that replaces CMake.

All of this while providing granular control over memory allocation sounds too good to be true! After reading good things from a few projects developed in Zig such as the JavaScript runtime/toolkit Bun, or the terminal emulator Ghostty developed by Mitchell Hashimoto (Hashicorp), I knew I had to at least explore it briefly.

Syntax / Semantics

Memory control

Memory control is rather interesting in Zig. In C, you have allocations through malloc and free. Zig has explicit allocator structs that are passed to any functions that need to allocate memory.

The tiny compiler parser needs to create AST nodes as tokens are parsed from the lexer. These are allocations that need to be made during execution, so the parser accepts an allocator as argument and uses it throughout its parse method.

There are many different types of allocators depending on the context: FixedBufferAllocator, ThreadSafeAllocator, c_allocator, GeneralPurposeAllocator. In my case, the ArenaAllocator makes the most sense. The ArenaAllocator will free all allocations in the arena at once. If I encounter an error during creation of the AST, I don't need any part of it, so I can free all allocations made with it, report the error encountered, and abort execution without memory leaks.

Comptime

Comptime is Zig's implementation of metaprogramming, allowing developers to leverage its strengths without much of the complexity. For a more detailed overview of Comptime, check out this article by Loris Cro, member of the Zig Software Foundation. There are are really interesting things discussed such as compile-time code elision: flattening a for loop with an inner switch-case into a linear sequence of operations using compile-time known parameters.

In the tiny compiler, I used comptime to generate the string representation of the TokenType enum:

pub fn toString(self: TokenType) []const u8 {
    const tokenStrs = comptime blk: {
        const Type = @This();
        var result: [std.enums.values(Type).len][]const u8 = undefined;
        for (std.enums.values(Type), 0..) |tag, i| {
            result[i] = switch (tag) {
                .Plus => "+",
                .Minus => "-",
                .Star => "*",
                .Slash => "/",
                .GreaterThan => ">",
                .LessThan => "<",
                .GreaterThanEqual => ">=",
                .LessThanEqual => "<=",
                .Equal => "=",
                .EqualEqual => "==",
                .LeftParen => "(",
                .RightParen => ")",
                else => @tagName(tag),
            };
        }
        break :blk result;
    };
    return tokenStrs[@intFromEnum(self)];
}

Notice that tokenStrs is labelled using comptime on line 2, and the result is an array of enum string representations, either a word -> symbol conversion or just the enum tag name such as Identifier. Within the compiled binary, this function will only include the pre-computed array and a simple array lookup. Without using comptime, I would need to have included a switch-case in the binary and perform string conversion at runtime. This is a free performance boost and demonstrates a fraction of what comptime is capable of.

Error handling

Handling errors in Zig is pretty painless. Return types are specified after a functions arguments. In the comptime example, the return type is []const u8, an array of bytes, which is how Zig represents the String type.

To declare that a function might throw an error, a ! would be prepended to the return type. To call a function that might throw an error, you would prepend your function call with try.

For example, opening a file and determining its size:

pub fn main() !void {
    const file = try std.fs.cwd().openFile("fileName", .{});
    const file_size = try file.getEndPos();
}

Since I've used the try keyword to call a function that may return an error, my main function now returns !void.

By default, errors are inferred in the anyerror type which is the global error set. We can however explicitly define our own error set and return that as a subset of the global error set.

For example, in the compiler I've defined a CompilerError error set as follows:

pub const CompilerError = error{
    // Lexer
    Overflow, UnexpectedCharacter, InvalidCharacter,

    // Parser
    UnexpectedToken, ExpectedIdentifier, InvalidExpression,

    // Type checking
    UndefinedVariable,

    // General
    OutOfMemory, FileNotFound,
};

Using this error set, I've enforced that any method returning errors using CompilerError must be in this set. The lexer can return CompilerError.UnexpectedCharacter when scanning characters, the parser can return CompilerError.UnexpectedToken when parsing tokens, etc.

Using try defers the handling of the error to the caller. You can instead catch and handle it immediately.

const ast_root = parser.parse() catch |err| {
    std.debug.print("Compilation failed while parsing. Aborting.\n", .{});
    // 'err' is of type CompilerError.
    // A switch-case over the CompilerError type can provide targeted error messages.
    // Zig switch-cases are exhaustive. All options must be acknowledged
    // or handled with a default clause.
    return;
};

Compiler things learnt

Aside from learning a bit of Zig, I also learnt about alternative ways to handle parts of the compiler than what I had done in my university course.

One optimization implemented was to maintain two pointers in the source string for the start and end of the current token being parsed. Previously, I had built a string as valid characters were scanned. Looking back now, this leads to useless allocations as strings are immutable so every append creates a new string.

Another sanity decision made was using recursive descent rather than a table-driven parser. The table-driven parser requires a large table that acts similar to a finite-state-machine. At a state, it can accept a character and transition to a new set of states that must follow. The benefit of this approach is simpler parser code since it's just a loop reading tokens and making sure it matches whats on top of the stack of valid transitions. The downside is that little issues are hellish to debug since the transition logic is in the CSV table. With recursive descent, this process is instead done in the implementation language where each 'state' is a method which handles the specific type of statement such as a print statement.

// Example: print x+1;
fn parsePrintStatement(self: *Parser) CompilerError!ASTNode {
    // Move to next token
    try self.advance();
    // Allocation with member ArenaAllocator. Can return CompilerError.OutOfMemory.
    const expr = try self.allocator.create(ASTNode);
    // Destroy allocation if any errors occur in this method after its execution.
    errdefer self.allocator.destroy(expr);
    // Try parsing the expression being printed, can return a CompilerError.
    expr.* = try self.parseExpression();

    // Semicolon is expected to end this statement.
    if (self.current_token.type != .Semicolon) {
        return self.reportError(CompilerError.UnexpectedToken, "Expected ';'.", "Semicolon");
    }
    // Move past the semicolon
    try self.advance();

    return ASTNode{ .PrintStmt = expr };
}

This made debugging a lot easier and made me wish I had used recursive descent during my university course to avoid headaches.

Zig iguana mascot.

Final thoughts

Zig is one of the more interesting languages I've come across, and I think it's primarily due to its emphasis on being a small, simple, yet powerful language. This is the same reason I appreciate Go so much. It gives me a lot of confidence to be able to open up source files and easily be able to get an idea of what's going on. Try doing that with C++ templates or async Rust.

Unfortunately, it hasn't reached version 1.0 yet and there are quite a few issues to iron out. There is a lot of work going on to move away from LLVM which would provide the opportunity for incremental compilation, and async/await has been removed until the compiler has advanced far enough. Zig has a bright future if it can reach its potential in the near future.

All in all, it was a fun project. I gained increased interest in compilers and a lot of respect for Zig as a language and community.