Cleaning up the parser (MPC -> LLVM for the Neil Language #2)
In my previous post Hooking up MPC & LLVM I had started hooking up MPC to LLVM, to start developing a custom language I’m calling Neil - Not Exactly an Intermediate Language.
Since then, I’ve tried to rationalise the various hacks I had to do on the original version just to get something working, and actually try and have a clean codebase for generating LLVM from our MPC grammar. Check out the GitHub repository here if you want to follow the progress.
I decided to wrap the logic into a C++ class called ASTLowering - this will encapsulate an entire invocation of MPC -> LLVM, and hold some structures for the symbol table, type table, the LLVM IRBuilder we are using, and the current function we are generating a body for. I did this just to try and keep down the number of parameters I’m passing down between the recursive function calls we’ll need to use to walk the AST and output our LLVM IR.
So next up - let us cover a bunch of the core concepts we need to parse our language to LLVM IR.
LLVM Values & Types⌗
Everything (near enough) in LLVM IR derives from the LLVM Value base class - in the C API this becomes LLVMValueRef. Functions, function Arguments, Instructions, Constants - they all derive from the LLVM Value class. It’s important to note that most of the things we’ll be dealing with are these values.
And everything that is an LLVM Value has an corresponding LLVM Type - Function Types, Integer Types, Pointer Types, etc, etc - we can query any value in LLVM to get its corresponding type.
Symbol Table⌗
First up is the symbol table, specifically what is a symbol table in the context of our parser?
The symbol table is just a map of an identifier (the name of the variable/function) to the LLVM Value that represents the variable/function in the IR. For instance, lets say we had this function:
As we are parsing, when we encounter the function ‘foo’ in our AST, we add an entry in our symbol table to record that there is an LLVM Value (in this case, the LLVM Function) whose name is ‘foo’. We also add in, only within the scope of the body of the function, that there is a symbol called ‘a’, whose LLVM Value is the first LLVM Argument of the function ‘foo'.
And that is it!
Type Table⌗
We keep a type table - a table of all known types that are being used and are valid in the program. At present, we just hardcode all the supported types we have - i1, i8, i16, i32, i64, u8, u16, u32, u64, f16, f32 and f64. The identifiers of each of these types in the Neil language (for example ‘u8’) is mapped to the corresponding LLVM Type - in the C API called LLVMTypeRef. Then, when we are parsing a type identifier in the AST (for example, the return type of a function definition), we simply look up the string name of the type in the type table, and use the corresponding LLVM Type when creating the value.
How our Parsing Works⌗
Let’s take a simple example of our Neil language:
We feed this to our mpc parser for the Neil language, which produces us the following AST:
Our parser starts at the root MPC node (always called ‘>’), and then looks for any functions defined in the source file. If we find a function, we’ll have a corresponding ‘procedure’ node in the AST.
We find the first procedure, which has the declaration:
The steps we follow to parse this are:
- A procedure always starts with a type. We lookup the contents of the first child of the procedure in our type table, which returns us the LLVM type for a 32 bit integer.
- The function has a identifier - the name of the function. We remember the second child’s contents is the name of the function.
- The third child is the opening ‘(‘ for any arguments to the function.
- While the next child is not the closing ‘)’ for the arguments to the function, we look for a type-identifier.
- The first child of a type-identifier is the type, which we again lookup in our type table, and store this into a vector.
- The second child of a type-identifier is the name for the argument of the type. We store this name into a vector.
- Once we reach the closing ‘)’ for the arguments, we can now create the LLVM Function to populate the symbol table with.
To create an LLVM Function we first need the type of the function - we need the return type, and an array of the argument types. With these (which we just parsed above), we can create the function type. Once we have the function type, we combine this with the name of the function we discovered in the second child of our procedure, and create the new LLVM Function. The last thing we need to do to finalise the declaration of our function is name each of the arguments to the function. We run through the arguments of the function calling LLVMSetValueName() using the vector of names we remembered when parsing our type-identifiers earlier.
And huzzah! We have our function declaration.
Next up, we parse the body of the function:
The function body consists of:
- The first child is the opening ‘{‘ for the body of the function.
- While the next child is not the closing ‘}’ for the body of the function, we loop through all of the statement AST nodes for the body of the function.
- Within the statement AST node, we support two kinds of statements at present - return statements, and function call statements. In the example showing we find that:
- We have a ‘return’ as the first child to the statement (thus it is a return statement type).
- If the second child is not ‘;’, we are returning a value from the function, in the form of an lexp AST node.
- The first child of an lexp AST node is always a term AST node.
- The first child of a term AST node is always a factor. In the example above we can see we have a ‘factor|ident|regex’ named node - this highlights a cool feature of MPC I’ll talk about later*.
- We parse the factor-that-is-an-identifier, which is the identifier ‘x’. We look this up in the symbol table, finding the argument to our function called ‘x’, and use that LLVM Value as the value for the factor.
- If there is a second child of a term AST node, it is either multiply ‘*’, divide ‘/’ or remainder ‘%’, in our example it is a multiply.
- We then parse the third child (if there was a second, there must be a third node according to the rules of our grammar), which is also a factor AST node.
- We parse this factor-that-is-a-literal, which is the constant ‘5’. Now, in our Neil language, constants don’t inherently have a type - we infer the type from the closest typed expression. So in this instance, we know that the type of the left-hand-side of our multiplication operation is the type ‘i32’, so our constant gets parsed into an LLVM Value that is the constant ‘5’ that is a 32 bit integer.
- The first child of a term AST node is always a factor. In the example above we can see we have a ‘factor|ident|regex’ named node - this highlights a cool feature of MPC I’ll talk about later*.
- if there is a second child of a lexp AST node, it is either addition ‘+’ or subtraction ‘-‘, in our example it is an addition.
- We then parse the third child of the lexp AST node (if there was a second child, there must be a third child), which is a term AST node:
- The term in this case is a term-that-is-a-factor-that-is-a-literal, which is the constant ‘42’. Again, our constant is un-typed, so we infer that the result of the left-hand-side of our addition has the type ‘i32’, and use that as the type for this constant.
- The first child of an lexp AST node is always a term AST node.
- The last child of the statement must be a ‘;’ to close the statement.
- And the last child of the procedure must be a ‘}’ to close the function.
And that is it! We have parsed our function, and we get the result in LLVM IR of:
Cool MPC Feature*⌗
When going through how we parse, I eluded to a cool MPC feature that we’d have to come back to - I like to call them ‘multi-type-AST-nodes’. If a node in the AST has exactly one child, MPC will fold the nodes together. A great example in the above AST is:
Which is a node that is a term -> factor -> literal -> regex! This means that we have to do a little more complex parsing when we look at the ‘tag’ of each AST node (see ‘lower_ast_node’ in the GitHub for Neil) - we can’t just say ‘this node is a term’, we have to look at the entire tag, and work backwards to work out how we should be parsing the value. While it increases the complexity of the parsing marginally, you can see why this would be a huge win if we had a super large input source file being parsed by MPC - we are using one node in place of four, a 25% saving in memory usage for the MPC format!
What is Next?⌗
So we’ve got basic parsing support into our language, and we can make simple programs. But we don’t have some really basic features that make languages actually usable yet. In the next post, we’ll add type identifiers, so we can declare variables within our function bodies. Stay tuned!