I [[musttail]] You About a Tokenizer

One thing I’ve never really messed with before is the [[clang::musttail]] compiler attribute. Tail calls are when you exit a function by calling another function as the final thing within the function. For example:

int foo(float x);

int bar(float x) {
  x += 42.0f;
  return foo(x);
}

In the above example, foo is a tail call within the function bar. Compilers can take advantage of this by doing something called tail-call optimization, which allows the compiler to not push a new stack frame, and also to change the call into a jump. This means you can have infinitely deep callstacks that won’t cause stack overflows for instance.

Clang has a cool compiler attribute [[clang::musttail]] that you can place on a return statement to force a tail call. EG. it becomes a compiler error if the compiler could not perform a tail call. For example functions with different arguments would not work.

int foo(float x);

int bar(float x) {
  x += 42.0f;
  [[clang::musttail]] return foo(x);
}

Another fun little thing I’ve been toying with is using lookup tables - in particular jump tables. Jump tables are arrays of function pointers such that the index into the array will result in you jumping to some other function to execute.

So an idea came to me - could I use tail calls in combination with jump tables to create a tokenizer for a toy language? A tokenizer where there were zero branches?

Turns out - yes you can! I’ve put up a gist of it here.

An example program in the toy language would be:

# Some cool comment!
some_func := (a : i32, b : f32) : f64 {
  x := a + 42;
  x += 13;
  r := 0;
  for i in range(0, x) {
    if b > 0.0e0 {
      r *= b;
    }
  }
  return r;
}

The tokenizer is basically a number of different jump tables all 256 sized (the number of unique values in an 8-bit ASCII value), and works by passing each 8-bit value as the index to the jump table and then doing a tail call on that function pointer - thus not creating a call frame. So the code goes into the main jump table, and from there will filter into sub jump tables whether the ASCII character is a digit, a symbol, a comment, etc. Lets look at parsing the comment from the example program above:

# Some cool comment!

Firstly - the main_jump_table_s class contains, well, the main jump table. And the # index into that jump table takes you to a function:

static int comment(tokenizer_s *const tokenizer, token_t *out_token) {
  *out_token = {token_type_comment, tokenizer->current, 1, tokenizer->line,
                tokenizer->column};
  tokenizer->skip_current();
  [[clang::musttail]] return comment_jump_table_s::singleton()
      .jt[tokenizer->data[tokenizer->current]](tokenizer, out_token);
}

This function:

Sets the token type to be a comment.
Stores the locations of the start of the token.
Bumps the tokenizer beyond the # character.
And then does a tail call into the comment_jump_table_s, using the next character in the input data stream for the index.

struct comment_jump_table_s final
    : public tokenizer_jump_table_s<comment_jump_table_s> {
  static int newline(tokenizer_s *const tokenizer, token_t *out_token) {
    tokenizer->current += 1;
    tokenizer->line += 1;
    tokenizer->column = 0;
    return 0;
  }

  static int keep_going(tokenizer_s *const tokenizer, token_t *out_token) {
    tokenizer->skip_current();
    out_token->length += 1;
    [[clang::musttail]] return comment_jump_table_s::singleton()
        .jt[tokenizer->data[tokenizer->current]](tokenizer, out_token);
  }

  constexpr comment_jump_table_s() : Super(&keep_going) {
    jt['\n'] = &newline;
    jt['\0'] = &return_zero;
  }
};

Then in the comment_jump_table_s we initialize the jump table by default to keep_going (see the Super constructor), and specify that for the two exit conditions we should call their respective functions (either newline, or end of string).

Through these various jump tables we can build up complex tokenization patterns without having a complex soup of branching, switches, loops. I’m actually quite impressed at how elegant this tokenizer turned out to be just 742 lines of code.

And what does the codegen look like too?

<_ZN17main_jump_table_s7commentEP11tokenizer_sP7token_s>:
;     *out_token = {token_type_comment, tokenizer->current, 1, tokenizer->line,
mov eax, dword ptr [rdi + 0x8]
mov dword ptr [rsi], 0x1
mov rcx, qword ptr [rdi + 0xc]
mov dword ptr [rsi + 0x4], eax
mov dword ptr [rsi + 0x8], 0x1
mov qword ptr [rsi + 0xc], rcx
;     current += 1;
mov eax, dword ptr [rdi + 0x8]
inc eax
mov dword ptr [rdi + 0x8], eax
;     column += 1;
inc dword ptr [rdi + 0x10]
;         .jt[tokenizer->data[tokenizer->current]](tokenizer, out_token);
mov rcx, qword ptr [rdi]
movzx eax, byte ptr [rcx + rax]
;     [[clang::musttail]] return comment_jump_table_s::singleton()
lea rcx, [rip]              # 0x34 <_ZN17main_jump_table_s7commentEP11tokenizer_sP7token_s+0x34>
jmp qword ptr [rcx + 8*rax]

And that’s overall not too shabby. I could probably change the function signature to pass bits of the tokenizer and the output token down individually rather than as pointer arguments, thus avoiding having to do indirect loads and stores through their pointers. But overall I’m quite happy that the whole thing worked out as nice as it did.