AI Compilation Principles 3

Thank you to the uploader ZOMI-chan: https://space.bilibili.com/517221395

Comparison of LLVM IR and GCC IR#

Feature	LLVM IR	GCC IR (GIMPLE)
Independence and Library Architecture	Highly modular, front-end and back-end separation, easy to add new languages and target platforms	Traditional GCC architecture, tighter coupling between front-end and back-end
Representation	Human-readable assembly form, C++ object form, serialized bitcode form	GIMPLE representation, three-address code, SSA form
Design and Application	More independent, reusable in tools outside the compiler, formally defined with a good C++ API, closer to hardware behavior	Reduces control flow complexity, optimization is relatively easy
Applicable Scenarios	Suitable for academic applications, as it has been significantly simplified, allowing for faster results	Suitable for industrial applications, can generate a unified AST for data flow analysis, or generate three-address code similar to GIMPLE for analysis

Advantages of LLVM IR#

More independent: LLVM IR is designed to be reusable in any tool outside the compiler, making it easy to integrate other types of tools, such as static analyzers and instrumentation tools.
More formal definition and better C++ API: This makes handling, transforming, and analyzing much easier.
Closer to hardware behavior: LLVM IR provides a simulated instruction set similar to RISCV and a strong type system, achieving its goal of being a "universal representation."

Advantages of GIMPLE#

Reduces control flow complexity: GIMPLE makes optimization relatively easy by reducing control flow complexity, using three-address representation, and limiting syntax.

LLVM Architecture Design#

LLVM Architecture Diagram:

Analysis of LLVM Core Process#

The compiler front-end workflow includes lexical, syntactic, and semantic analysis; the intermediate optimization layer involves Pass optimizations in large data; the compiler back-end workflow includes machine instruction selection, register allocation, and instruction scheduling.

Practice: Clang Compilation Process#

Generate .i file
```
clang -E -c .\hello.c -o .\hello.i
```

Convert the preprocessed .i file to .bc file

clang -emit-llvm .\hello.i -c -o .\hello.bc
clang -emit-llvm .\hello.c -S -o .\hello.ll

Use llc and lld linker

llc .\hello.ll -o .\hello.s
llc .\hello.bc -o .\hello2.s

Convert to executable binary file
```
clang .\hello.s -o hello
```
View compilation process
```
clang -ccc-print-phases .\hello.c
```

Summary#

Interaction between LLVM components occurs at a high level of abstraction, with different components isolated as separate program libraries, making it easy to integrate transformations and optimization Passes throughout the compilation pipeline. It is now used as a universal infrastructure for implementing various static and runtime compiled languages.

Detailed Explanation of LLVM IR#

Design Philosophy of LLVM IR#

LLVM IR adopts a Static Single Assignment (SSA) form, which has two important characteristics:!

SSA Static Single Assignment#

In LLVM IR, each variable must be defined before use, and each variable can only be assigned once. For example, in 1 * 2 + 3:

Basic Syntax of LLVM IR#

LLVM IR is a low-level virtual instruction set similar to Reduced Instruction Set Computing (RISC), supporting a linear sequence of simple instructions.

LLVM IR is a low-level virtual instruction set similar to RISC;
Like real RISC, it supports a linear sequence of simple instructions, such as addition, subtraction, comparison, and branching;
Instructions are all in three-address form, accepting a certain number of inputs and storing computation results in different registers;
Unlike most RISC, LLVM uses a strongly typed simple type system and abstracts away machine differences;
LLVM IR does not use fixed-named registers; it uses temporary registers named with the % character;

Each three-address code instruction can be decomposed into a four-tuple (4-tuple) form: (operator, operand1, operand2, result). Since each statement contains three variables, that is, each instruction has at most three operands, it is called three-address code.

Instruction Type	Instruction Form	Four-Tuple Representation
Assignment Instruction	`z = x op y (z = x + y)`	`(op, x, y, z)`

LLVM IR Memory Model#

The basic unit of an LLVM IR file is called a module. A module can contain multiple top-level entities, such as functions and global variables. A function definition must have at least one basic block. Each basic block contains several instructions and ends with a terminator instruction.

Class Name	Description
`Module`	The Module class aggregates all data used in the entire translation unit, it is synonymous with "module" in LLVM terminology. It declares the Module::iterator typedef as a convenient way to iterate over the functions in this module. You can obtain these iterators using the begin() and end() methods.
`Function`	The Function class contains all objects related to function definitions and declarations. For declarations (checked with isDeclaration() to see if it is a declaration), it only contains the function prototype. Regardless of whether it is a definition or declaration, it contains a list of function parameters, which can be accessed via the getArgumentList() method or the arg_begin() and arg_end() methods. You can iterate over them using the Function::arg_iterator typedef. If the Function object represents a function definition, you can traverse its contents with a statement like: for (Function::iterator i = function.begin(), e = function.end(); i != e; ++i), which will iterate over its basic blocks.
`BasicBlock`	The BasicBlock class encapsulates a sequence of LLVM instructions, which can be accessed via begin()/end(). You can directly access its last instruction using the getTerminator() method, and you can use some helper functions to traverse the CFG, such as accessing the predecessor basic block with getSinglePredecessor() when a basic block has a single predecessor. However, if it has multiple predecessor basic blocks, you need to traverse the predecessor list yourself, which is not difficult; just iterate through the basic blocks and check their terminator instruction's target basic block.
`Instruction`	The Instruction class represents an atomic operation in LLVM IR, a single instruction. High-level assertions can be obtained using methods like isAssociative(), isCommutative(), isIdempotent(), and isTerminator(), but its precise functionality can be known through getOpcode(), which returns a member of the llvm::Instruction enumeration representing the LLVM IR opcode. Its operands can be accessed using the op_begin() and op_end() methods, inherited from the User superclass.

The most important concepts in the LLVM IR memory model: Value, Use, User

In the LLVM IR memory model, Value, Use, and User are three core concepts, and their relationships define the data flow and control flow in LLVM.

Concept	Description
`Value`	In LLVM, Value is a very basic concept that represents any entity with a value, such as constants, variables, functions, etc. Each Value has a unique number used to identify itself within LLVM. A Value can also have users (User), meaning it can be an operand of other instructions.
`Use`	Use is an instance of a Value being used. In LLVM, each Value has one or more Uses, indicating which instructions use this Value. A Use contains a pointer to the User that uses the Value and the operand index within that User.
`User`	User refers to instructions or constants that use a Value. For example, an instruction may have multiple operands, each of which is a Value, making that instruction a User. Users reference their operand Values through Use objects.

These three concepts together form the memory model of LLVM IR, and their relationships reflect the data dependencies between instructions. In LLVM's optimization process, these concepts are crucial for analyzing and managing dependencies between instructions.

LLVM Frontend and Optimization Layer#

LLVM Frontend#

The compiler front-end transforms source code into the compiler's intermediate representation, LLVM IR.

Lexical analysis
The first step of the front-end processes the textual input of the source code, breaking down the language structure into a set of words and tokens, removing comments, whitespace, tabs, etc. Each word or token must belong to a subset of the language, and reserved words of the language are transformed into internal representations of the compiler.

$ clang -cc1 -dump-tokens hello.c

Syntactic analysis
Groups tokens to form expressions, statements, function bodies, etc. Checks whether a set of tokens is meaningful, considering the physical layout of the code, without analyzing the meaning of the code, similar to syntactic analysis in English, which does not care about what you said, only whether the sentence is correct, and outputs a syntax tree (AST).

$ clang -fsyntax-only -Xclang -ast-dump hello.c

Semantic analysis
Uses a symbol table to check that the code does not violate the language's type system. The symbol table stores the mapping between identifiers and their respective types, as well as other content. An intuitive way to check types is to traverse the AST while collecting type information from the symbol table after parsing.

$ clang -c hello.c

LLVM Optimization Layer#

Optimization typically consists of analysis Passes and transformation Passes.
Optimization typically consists of analysis Passes and transformation Passes.

Analysis Pass: Responsible for uncovering properties and optimization opportunities;
Transformation Pass: Generates necessary data structures for subsequent use;

opt hello.bc -instcount -time-passes -domtree -o hello-tmp.bc -stats

LLVM Backend Code Generation#

Backend Architecture#

The backend consists of a set of analysis and transformation Passes, whose task is code generation.

Instruction Selection#

In memory, LLVM IR is transformed into target-specific SelectionDAG nodes.

Instruction Scheduling#

The first instruction scheduling, also known as pre-register allocation (RA) scheduling; orders instructions while trying to discover as many instruction-level parallelisms as possible; then instructions are transformed into MachineInstr three-address representation.

Register Allocation#

Register allocation converts infinite virtual register references into a finite set of target-specific registers. When there are not enough registers, they are spilled to memory.

Instruction Scheduling#

The second instruction scheduling, also known as post-register allocation (RA) scheduling; at this point, real register information is available, and certain types of registers have latencies that can be used to improve instruction order.

Code Emission#

The code emission phase transforms instructions from MachineInstr representation into MCInst instances.