AI Compilation Principles 4

Thanks to the uploader ZOMI-chan: https://space.bilibili.com/517221395

Why Do We Need AI Compilers#

Challenges Faced#

Challenge Category	Description	Example
Operator Challenge	An increasing number of new operators are proposed, leading to an exponential rise in the workload for developing, maintaining, optimizing, and testing operator libraries.	1. Hardware not only needs to implement new operators but also requires feature optimization and testing in conjunction with the hardware to fully leverage hardware performance. For example, for Convolution operations, it needs to be converted to GEMM matrix multiplication; for the newly proposed Swish operator, hardware needs to implement the corresponding Swish functionality. 2. Hardware vendors may also release optimized libraries, but developing similar optimized libraries will increase the workload for operator optimization and encapsulation, leading to excessive reliance on libraries and ineffective utilization of dedicated hardware chip capabilities.
Optimization Challenge	The explosion of dedicated acceleration chips has made performance portability a necessity.	1. Most NPUs use ASICs, which have special instruction optimizations for computation, storage, and data movement in neural network scenarios to enhance the performance of AI-related computations. 2. Different vendors provide various ISAs for XPUs, lacking compilation toolchains like GCC and LLVM, making it difficult to port existing optimized operator libraries and optimization passes for CPU and GPU to NPUs in the short term.

Traditional Compilers vs. AI Compilers#

The main differences between traditional compilers and AI compilers are:

IR Differences: The IR of AI compilers abstracts different concepts and meanings compared to traditional compilers.
1. AI compilers generally have a high-level IR to abstractly describe operations in deep learning models, such as Convolution, Matmul, etc., and some may even have graph structures associated with Transformers.
2. Traditional compilers have a relatively low-level IR used to describe basic instruction operations, such as load and store. With high-level IR, AI compilers can more conveniently describe DSLs for deep learning models.
Optimization Strategies: AI compilers are focused on the AI field, introducing more domain-specific knowledge during optimization to perform more high-level and aggressive optimization techniques. For example:
1. AI compilers perform operator fusion in high-level IR, while traditional compilers tend to be more conservative when executing similar loop fusion. The downside is that it may lead to difficulties in tracking debugging execution information.
2. AI compilers can reduce computation precision, such as int8, fp16, bf16, etc., because deep learning is less sensitive to computation precision. However, traditional compilers generally do not perform optimizations that change variable types and precision.

Development Stages of AI Compilers#

Inference Scenario: Input AI framework-trained model files, output programs that can be efficiently executed on different hardware;
Training Scenario: Input neural network code represented in high-level languages, output programs that can be efficiently executed on different hardware;

What is a training scenario? What is an inference scenario?
Why is it necessary to understand algorithms when working on AI compilers?
Why is it necessary to understand compilers when working on AI operators?

What is an AI Compiler#

A dynamic interpreted language frontend primarily based on Python
Multi-layer IR design, including graph compilation, operator compilation, and code generation
Specific optimizations for neural networks and deep learning
Support for DSA chip architectures

Development Stages of AI Compilers#

Stage I: Naive AI Compiler#

Early versions of TensorFlow, based on a neural network programming model, primarily performed two layers of abstraction: graph and ops.

Graph Layer: Executes in a declarative programming manner using static graphs, performing hardware-independent and hardware-dependent compilation optimizations before execution. Hardware-independent optimizations include expression simplification, constant folding, automatic differentiation, etc.; hardware-dependent optimizations include operator fusion, memory allocation, etc.
Operator Layer: Typically implemented using handwritten kernels, such as implementing a large number of .cu operators based on CUDA kernels on NVIDIA GPUs or relying on CuDNN optimized libraries.

In terms of expression:

The expression of static graphs is not native to Python, and developers primarily use the Python API provided by the framework to display the graph, which is not user-friendly.

In terms of performance:

The emergence of DSA dedicated acceleration chips has intensified performance challenges.
Once the operator granularity and boundaries provided by the operator layer are determined, it is difficult to fully leverage hardware performance.
The operator optimization libraries provided by hardware vendors may not be optimal.
- 1. Under the conditions where the model and shape are determined, there may be better operator implementations.
- 1. In SIMT and SIMD architectures, there is significant room for scheduling and tiling.

Stage II: Specialized AI Compiler#

In terms of expression:

The flexible expression API of PyTorch has become a reference benchmark for AI frameworks, and the neural network compiler at the graph layer mainly considers how to convert PyTorch-like expressions into graph layer IR for optimization.
PyTorch-like native Python expressions are statically converted.
AI-specific compiler architecture opens the boundaries of graphs and operators for fusion optimization.

In terms of performance:

Opening the boundaries of computation graphs and operators for recombination optimization to leverage chip computing power. The operators in the subgraph of the computation graph are opened into smaller operators, and the subgraph composed of smaller operators undergoes compilation optimization, including buffer fusion, horizontal fusion, etc. The key is how to open large operators and how to recombine small operators.

Expression Separation: The computation graph layer and operator layer remain separate, with algorithm engineers primarily focusing on the expression of the graph layer, while operator expression and implementation are mainly provided by framework developers and chip vendors.
Functional Generalization: It is challenging to meet complex requirements for flexible expression, dynamic-static graph conversion, dynamic shape, sparse computation, and distributed parallel optimization.
Balancing Efficiency and Performance: The implementation of operators lacks automation in scheduling, tiling, and code generation, creating a high barrier, requiring developers to understand operator computation logic and be familiar with hardware architecture.

Stage III: General AI Compiler#

Unified expression of graphs and operators, achieving fusion optimization.
Automatic scheduling, tiling, and code generation for operator implementation, lowering development barriers.
More generalized optimization capabilities, achieving dynamic-static unification, dynamic shape, sparsity, higher-order differentiation, and automatic parallelism.
Including compilers and runtime, modular representation and combination of heterogeneous computing from edge to data center, focusing on usability.

General Architecture of AI Compilers#

IR Intermediate Representation#

Compilers are mainly divided into frontend and backend, targeting hardware-independent and hardware-dependent processing, respectively. Each part has its own IR and will perform optimizations:

High-level IR: Used to represent computation graphs, primarily to address the difficulty of expressing complex operations in deep learning models in traditional compilers, and a new set of IR is designed to achieve more efficient optimizations.
Low-level IR: Can represent models at a finer granularity, allowing for hardware-specific optimizations, which are categorized into three types in the text.

Frontend Optimization#

After constructing the computation graph, the frontend will apply graph-level optimizations. Since the graph provides a global overview of the computation, it is easier to discover and execute many optimizations at the graph level. Frontend optimizations are hardware-independent, meaning that computation graph optimizations can be applied to various backend targets. Frontend optimizations are divided into three categories:

Node-level optimizations, such as Zero-dim-tensor elimination, Nop Elimination
Block-level optimizations, such as algebraic simplification, constant folding, operator fusion
Data flow-level optimizations, such as Common sub-expression elimination, DCE

Backend Optimization#

Optimizations for Specific Hardware
- The goal is to obtain high-performance code for specific hardware architectures. 1) Low-level IR is converted to LLVM IR, utilizing the LLVM infrastructure to generate optimized CPU/GPU code. 2) Using domain knowledge for custom optimizations, which can more effectively utilize the target hardware.
Automatic Tuning
- Due to the vast search space for parameter tuning in specific hardware optimizations, it is necessary to utilize automatic tuning to determine the best parameter settings. 1) Halide/TVM allows scheduling and computation expressions to be separated, using automatic tuning to derive optimal configurations. 2) Applying the polyhedral model for parameter tuning.
Optimized Kernel Libraries
- Vendor-specific optimized kernel libraries are widely used for accelerating DL training and inference on various hardware. When specific optimized primitives can meet computational requirements, using optimized kernel libraries can significantly improve performance; otherwise, they may be constrained by further optimizations.

Challenges and Reflections on AI Compilers#

XLA: Optimizing Machine Learning Compiler#

XLA (Accelerated Linear Algebra) is a domain-specific linear algebra compiler that can speed up TensorFlow model execution, potentially without requiring any changes to the source code.

XLA: Opens operators in the subgraph of the computation graph into smaller operators, and compiles optimizations based on subgraphs composed of smaller operators, including buffer fusion, horizontal fusion, etc. The key is how to open large operators, how to recombine small operators, and how to generate new large operators. The overall design is mainly implemented through HLO/LLO/LLVM IR, and all pass rules are manually specified in advance.

TVM: End-to-End Deep Learning Compiler#

To enable optimization at both the computation graph and operator levels for various hardware backends, TVM obtains high-level representations of DL programs from existing frameworks and generates low-level optimized code for multiple hardware platforms, aiming to demonstrate competitiveness with manual tuning.

TVM: Divided into Relay and TVM layers, Relay focuses on the graph layer, while TVM focuses on the operator layer, optimizing the frontend subgraph. Relay focuses on operator fusion, while TVM focuses on generating new operators and kernels. The distinction is that TVM has an open architecture, and Relay aims to integrate various frontends. TVM is also a tool for independent operator development and compilation, employing a separation scheme for Compute (designing computation logic) and Schedule (specifying scheduling optimization logic).

Tensor Comprehensions: Neural Network Language Compiler#

TensorComprehensions is a language that can build just-in-time (JIT) systems, allowing programmers to efficiently implement low-level code such as GPU using high-level programming languages.

TC: Operator computation logic is relatively easy to implement, but scheduling development is challenging, requiring familiarity with both algorithm logic and hardware architecture. Additionally, once the boundaries of the computation graph are opened and small operators are fused, new operators and kernels will be generated, making it difficult to generate schedules for new operators. Traditional methods define scheduling templates; TC aims to achieve auto-scheduling through the polyhedral model.

nGraph: Deep Learning System Compiler Compatible with All Frameworks#

nGraph operates as the cornerstone for more complex DNN operations within deep learning frameworks, ensuring an ideal efficiency balance between inference computation and training and inference computation.

Challenges#

Dynamic Shape and Dynamic Computation Graph:
- Current Situation: Mainstream AI compilers primarily target specific static shape inputs for compilation optimization, with limited support for dynamic computation graphs that contain control flow semantics.
- Problem: There is a significant demand for dynamic computation graphs in AI application scenarios. Although the frontend can attempt to rewrite computation graphs as static computation graphs or expand suitable subgraphs for optimization, these methods do not solve all problems.
- Example: Certain AI tasks (such as pyramid structure detection models) cannot be statically rewritten through manual intervention, making it difficult for compilers to optimize effectively in these cases.

Python Compilation Staticization:

Method	Description
Python JIT VM	Such as PyPy or CPython, attempting to add JIT compilation acceleration capabilities on top of Python's interpreted execution, but compatibility issues exist.
Modifier Methods	Such as torch.jit.script, torch.jit.trace, torch.fx, LazyTensor, TorchDynamo. Although these methods are widely used, a complete solution is still lacking.

Python JIT VM

**Modifier Methods**

- **Challenges for AI Compilers in Python Staticization**:
    - **Type Inference**: From Python's dynamic types to static types in compiler IR.
    - **Control Flow Expression**: Expression of control statements such as if, else, while, for, etc.
    - **Flexible Syntax and Data Type Conversion**: Handling operations such as Slice, Dict, Index, etc.
    - **JIT Compilation Performance**: Whether tracing-based or AST transform, additional compilation overhead is required.

3. Leveraging Hardware Performance, Especially DSA-type Chips:

Current Situation: The importance of DSA (Dedicated Hardware Architecture) in AI training and inference is increasing, such as SIMD units in CPUs, SIMT architectures in GPUs, and Huawei Ascend's Cube cores.
Challenges:
- Performance Optimization Depends on Graph-Operator Fusion: Requires independent optimization of graph and operator layers to fully leverage chip performance, necessitating graph-operator fusion optimizations, such as subgraph partitioning, vertical fusion optimization within subgraphs, and horizontal parallel optimization.
- Increased Optimization Complexity: Involves multi-level storage structures for scalars, vectors, tensors, and acceleration instructions, complicating scheduling, tiling, vectorization, and tensorization of kernel implementations.
Solutions:
- Open the boundaries of graphs and operators for recombination optimization to fully leverage chip performance.
- Integrate various optimization techniques: vertical fusion optimization (such as Buffer Fusion) and horizontal parallel optimization (such as Data Parallel).
- Automatically generate kernel code for recombined optimized subgraphs, including scheduling, tiling, and vectorization.

Handling Neural Network Characteristics: Automatic Differentiation, Automatic Parallelism, etc.:

Task	Current Situation	Challenges
Automatic Parallelism	Current large model training faces memory and performance walls, requiring complex parallel strategies. Scale out: Multi-dimensional mixed parallel capabilities, including data parallelism, tensor parallelism, pipeline parallelism, etc. Scale up: Re-computation, mixed precision, heterogeneous parallelism, etc.	Relies on manually configured partitioning strategies, creating high barriers and low efficiency. Semi-automatic parallelism can solve some efficiency issues, but achieving automatic identification of optimal parallel strategies depends on solving compilation and convex optimization problems.
Automatic Differentiation	Control Flow: Dynamic graphs execute control flow on the Python side, leading to performance degradation with many loop iterations; Static graphs solve automatic differentiation through logical stitching or computation graph expansion, which can somewhat alleviate performance issues, but the solution still needs improvement. Higher-order Differentiation: Effectively expands higher-order forward and backward differentiation through the Jacobian matrix; simulates rapid computation of higher-order differentiation through the Hessian matrix.	Control Flow: Dynamic graphs experience performance drops when handling complex control flows, especially with many loop iterations. Static graph solutions need further optimization to improve performance and stability. Higher-order Differentiation: How to effectively expand and compute higher-order forward and backward differentiation, ensuring accuracy and efficiency in computations. Developers need to flexibly control higher-order differentiation graphs.

Balancing Usability and Performance

Aspect	Description
Boundaries and Integration with AI Frameworks	Different AI frameworks have varying abstract descriptions and API interfaces for deep learning tasks, each with its characteristics. It is necessary to consider how to transparently support user computation graph descriptions without fully supporting all operators.
User Transparency Issues	Some AI compilers are not fully automated compilation tools, and performance depends on high-level abstract implementation templates provided by users (such as TVM). This reduces the labor cost of manually tuning operator implementations, but existing abstractions may not adequately describe the operator implementations required for innovative hardware architectures, necessitating familiarity with compiler architecture for secondary development or architectural reconstruction, creating high barriers and heavy development burdens.
Compilation Overhead	As performance optimization tools, AI compilers are only practically valuable when the compilation overhead is sufficiently advantageous compared to performance gains. In certain application scenarios, the requirements for compilation overhead are high, and using AI compilers may hinder rapid model debugging and validation, increasing development and deployment difficulties and burdens.
Performance Issues	The essence of compiler optimization is to replace manual optimization methods or difficult-to-explore optimization methods through generalization and abstraction, substituting the labor costs of manual optimization with limited compilation overhead. Deep learning compilers can only realize their value when they can genuinely replace or exceed manual optimization in performance.
Robustness	Most AI compilers are still in the research stage, with significant gaps in product maturity compared to industrial applications. It is necessary to consider whether computation graphs can be compiled smoothly, ensuring the correctness of computation results, and the ability to track and debug errors.

Other Issues#

Can Graphs and Operators Be Unified for Expression, Compilation Optimization, and Form a General AI Compiler?
Under the current AI framework, the graph layer and operator layer are expressed and optimized separately, with algorithm engineers primarily engaging with graph layer expressions, while AI framework, chip, and kernel development engineers are mainly responsible for operator expressions. In the future, will there be an IR that bridges the gap between graphs and operators, and under new scenarios driven by AI + scientific computing, AI + big data, etc., will the boundaries between computation graph layers and operator layers become less distinct, allowing for unified AI compilation optimization?
Is Complete Automatic Parallelism Feasible?
Automatic parallelism can automatically perform distributed training based on user-input serial network models and provided cluster resource information, supporting distributed training across any parallel and various hardware cluster resources through a unified distributed computation graph and unified resource graph design, and can adaptively select hardware-aware parallel strategies for training tasks using a global cost model planner. In reality, automatic parallelism is a strategy search problem, and while strategy search can find a suboptimal answer within a limited search space, whether true automatic parallelism can be achieved requires further consideration and validation.
Do AI Chips Need Compilers? Do AI Chips Need AI Compilers?
The dependency of AI chips on compilers depends on the design of the chips themselves. The more flexible the chip, the greater its dependency on compilers. In the early design of AI chips, there was a CISC style that resolved optimizations internally within the chip. However, as dedicated domains evolve to support more flexible demands, AI chips themselves will become increasingly flexible while retaining tensor instruction sets and special memory structures. Future architects will need to design chips and systems in collaboration, with automation increasingly applied to dedicated chips.

Future#

Compiler Forms: Separate inference and training, coexistence of AOT and JIT compilation methods.
IR Forms: Need a unified IR representation for AI similar to MLIR.
Automatic Parallelism: Provide compilation optimization capabilities for automatic parallelism across machines and nodes.
Automatic Differentiation: Provide methods for computing higher-order differentiation and facilitate operations on graphs.
Automatic Kernel Generation: Lower development barriers and quickly implement efficient and highly generalized operators.