How to read & write LLVM bitcode

I’ve read multiple posts on social media now complaining about how scary LLVM is to get to grips with. It doesn’t help that the repository is ginormous, there are frequently hundreds of commits a day, the mailing list is nearly impossible to keep track of, and the resulting executables are topping 40Mb now…

Those tidbits aside - LLVM is super easy to work with once you get to grips with the beast that it is. To help aid people in using LLVM, I thought I’d put together the most trivial no-op example you can do with LLVM - parsing one of LLVM’s intermediate representation files (known as bitcode, file extension .bc) and then writing it back out.

Firstly, lets go through some high level LLVM terms:

LLVM’s main abstraction for user code is the Module. It’s a class that contains all the functions, global variables, and instructions for the code you or other users write.
Bitcode files are effectively a serialization of an LLVM Module such that it can be reconstructed in a different program later.
LLVM uses MemoryBuffer objects to handle data that comes from files, stdin, or arrays.

For my example, we’ll use the LLVM C API - a more stable abstraction ontop of LLVM’s core C++ headers. The C API is really useful if you’ve got code that you want to work with multiple versions of LLVM, it’s significantly more stable than the LLVM C++ headers. (An aside, I use LLVM extensively for my job and nearly every week some LLVM C++ header change will break our code. I’ve never had the C API break my code.)

First off, I’m going to assume you’ve pulled LLVM, built and installed it. Some simple steps to do this:

After doing the above, you’ll have an LLVM install in /build/install!

So for our little executable I’ve used CMake. CMake is by far the easiest way to integrate with LLVM as it is the build system LLVM also uses.

So now we’ve got our CMake setup, and we can use our existing LLVM install, we can now get working on our actual C code!

So to use the LLVM C API there is one header you basically always need:

And two extra headers we need for our executable are the bitcode reader and writer:

Now we create our main function. I’m assuming here that we always take exactly 2 command line arguments, the first being the input file, the second being the output file. LLVM has a system whereby if a file named ‘-‘ is provided, that means read from stdin or write to stdout, so I decided to support that too:

So first we parse the input file. We’ll create an LLVM memory buffer object from either stdin, or a filename:

So after this code, memoryBuffer will be usable to read our bitcode file into an LLVM module. So lets create the module!

Once we’ve got our module, we no longer need the memory buffer, so we can free up the memory straight away. And that’s it! We’ve managed to take an LLVM bitcode file, deserialize it into an LLVM module, which we could (I’m not going to in this blog post at least!) fiddle with. So lets assume you’ve done all you wanted with the LLVM module, and want to write the sucker back out to a bitcode file again.

The approach is orthogonal to the reading approach, we look for the special filename ‘-‘ and handle accordingly:

Lastly, we should be good citizens and clean up our garbage, so also delete the module to:

And that’s it! You are now able to parse and then write an LLVM bitcode file. I’ve put the full example up here on GitHub - https://github.com/sheredom/llvm_bc_parsing_example.

Maybe I’ll do a follow-up post at some point taking you through the basics of how to do things to an LLVM module, but for now, adieu!