2019-03-27

使用LLVM创建有效的编译器

LLVM Pass

LLVM以其提供的优化特性而著名。优化被实现为Pass。这里需要注意的是 LLVM为您提供了使用最少量的代码创建实用阶段 (utility pass)的功能。例如，如果不希望使用 “hello”作为函数名称的开头，那么可以使用一个实用Pass来实现这个目的。

了解 LLVM opt 工具

从opt的手册页中可以看到，opt命令是模块化的LLVM优化器和分析器。一旦您的代码支持定制Pass，您将使用 opt把代码编译为一个共享库并对其进行加载。如果您的 LLVM 安装进展顺利，那么opt 应该已经位于您的系统中。opt 命令接受 LLVM IR（扩展名为 .ll）和LLVM 位码格式（扩展名为 .bc），可以生成 LLVM IR或位码格式的输出。下面展示了如何使用 opt 加载您的定制共享库：

$ opt –load=mycustom_pass.so –help –S

还需注意，从命令行运行 opt –help 会生成一个 LLVM将要执行的阶段的细目清单。对 help 使用 load选项将生成一条帮助消息，其中包括有关定制阶段的信息。

创建定制的 LLVM Pass

您需要在 Pass.h 文件中声明 LLVM Pass，该文件在我的系统中被安装到/usr/include/llvm 下。该文件将各个阶段的接口定义为 Pass类的一部分。各个Pass的类型都从 Pass中派生，也在该文件中进行了声明。阶段类型包括：

BasicBlockPass类 用于实现本地优化，优化通常每次针对一个基本块或指令运行
FunctionPass 类 用于全局优化，每次执行一个功能
ModulePass 类 用于执行任何非结构化的过程间优化

由于您打算创建一个阶段，该Pass拒绝任何以 “Hello”开头的函数名，因此需要通过从 FunctionPass 派生来创建自己的阶段。从Pass.h中复制清单 1 中的代码。

清单 1. 覆盖 FunctionPass 中的 runOnFunction 类

Class FunctionPass : public Pass {
  /// explicit FunctionPass(char &pid) : Pass(PT_Function, pid) {}
  /// runOnFunction - Virtual method overridden by subclasses to do the
  /// per-function processing of the pass.
  ///
  virtual bool runOnFunction(Function &F) = 0;
  /// …
};

同样，BasicBlockPass 类声明了一个 runOnBasicBlock，而 ModulePass类声明了 runOnModule 纯虚拟方法。子类需要为虚拟方法提供一个定义。

返回到清单 1中的 runOnFunction 方法，您将看到输出为类型Function 的对象。深入钻研 /usr/include/llvm/Function.h文件，就会很容易发现 LLVM 使用 Function 类封装了一个 C/C++函数的功能。而 Function 派生自 Value.h 中定义的 Value 类，并支持getName 方法。清单 2显示了代码。

清单 2. 创建一个定制 LLVM Pass

#include "llvm/Pass.h"
#include "llvm/Function.h"
class TestClass : public llvm::FunctionPass {
public:
virtual bool runOnFunction(llvm::Function &F)
  {
    if (F.getName().startswith("hello"))
    {
      std::cout << "Function name starts with hello\n";
    }
    return false;
  }
};

清单 2中的代码遗漏了两个重要的细节：

FunctionPass 构造函数需要一个 char，用于在 LLVM 内部使用。LLVM使用 char 的地址，因此您可以使用任何内容对它进行初始化。
您需要通过某种方式让 LLVM 系统理解您所创建的类是一个新阶段。这正是RegisterPass LLVM 模板发挥作用的地方。您在 PassSupport.h头文件中声明了 RegisterPass 模板；该文件包含在 Pass.h中，因此无需额外的标头。

清单 3. 注册 LLVM Function Pass

class TestClass : public llvm::FunctionPass
{
public:
  TestClass() : llvm::FunctionPass(TestClass::ID) { }
  virtual bool runOnFunction(llvm::Function &F) {
    if (F.getName().startswith("hello")) {
      std::cout << "Function name starts with hello\n";
    }
    return false;
  }
  static char ID; // could be a global too
};
char TestClass::ID = 'a';
static llvm::RegisterPass<TestClass> global_("test_llvm", "test llvm", false, false);

RegisterPass 模板中的参数 template 是将要在命令行中与 opt一起使用的阶段的名称。也就是说，您现在所需做的就是在清单 3中的代码之外创建一个共享库，然后运行 opt 来加载该库，之后是使用RegisterPass 注册的命令的名称（在本例中为test_llvm），最后是一个位码文件，您的定制阶段将在该文件中与其他阶段一起运行。清单4中概述了这些步骤。

清单 4. 运行定制Pass

bash$ g++ -c pass.cpp -I/usr/local/include `llvm-config --cxxflags`
bash$ g++ -shared -o pass.so pass.o -L/usr/local/lib `llvm-config --ldflags -libs`
bash$ opt -load=./pass.so –test_llvm < test.bc

现在让我们了解另一个工具：clang。

clang 简介

开始之前的注意事项

LLVM 拥有自己的前端：名为 clang 的一种工具（恰如其分）。Clang是一种功能强大的 C/C++/Objective-C 编译器，其编译速度可以媲美甚至超过GNU Compiler Collection (GCC) 工具。更重要的是，clang拥有一个可修改的代码基，可以轻松实现定制扩展。本文将对 LLVM 前端使用该API 并开发一些小的应用程序来实现预处理和解析功能。

常见的 clang 类

您需要熟悉一些最常见的 clang 类：

CompilerInstance
Preprocessor
FileManager
SourceManager
DiagnosticsEngine
LangOptions
TargetInfo
ASTConsumer
Sema
ParseAST 也许是最重要的 clang 方法。

稍后将详细介绍 ParseAST 方法。

要实现所有实用的用途，考虑使用适当的 CompilerInstance编译器。它提供了接口，管理对 AST的访问，对输入源进行预处理，而且维护目标信息。典型的应用程序需要创建CompilerInstance 对象来完成有用的功能。清单 5 展示了CompilerInstance.h 头文件的大致内容。

清单 5. CompilerInstance 类

class CompilerInstance : public ModuleLoader {
  /// The options used in this compiler instance.
  llvm::IntrusiveRefCntPtr<CompilerInvocation> Invocation;
  /// The diagnostics engine instance.
  llvm::IntrusiveRefCntPtr<DiagnosticsEngine> Diagnostics;
  /// The target being compiled for.
  llvm::IntrusiveRefCntPtr<TargetInfo> Target;
  /// The file manager.
  llvm::IntrusiveRefCntPtr<FileManager> FileMgr;
  /// The source manager.
  llvm::IntrusiveRefCntPtr<SourceManager> SourceMgr;
  /// The preprocessor.
  llvm::IntrusiveRefCntPtr<Preprocessor> PP;
  /// The AST context.
  llvm::IntrusiveRefCntPtr<ASTContext> Context;
  /// The AST consumer.
  OwningPtr<ASTConsumer> Consumer;
 /// \brief The semantic analysis object.
  OwningPtr<Sema> TheSema;
 //… the list continues
};

预处理 C 文件

在 clang 中，至少可以使用两种方法创建一个预处理器对象：

直接实例化一个 Preprocessor 对象
使用 CompilerInstance 类创建一个 Preprocessor 对象

让我们首先使用后一种方法。

使用 Helper 和实用工具类实现预处理功能

单独使用 Preprocessor 不会有太大的帮助：您需要 FileManager 和SourceManager 类来读取文件并跟踪源位置，实现故障诊断。FileManager类支持文件系统查找、文件系统缓存和目录搜索。查看 FileEntry类，它为一个源文件定义了 clang 抽象。清单 6提供了FileManager.h 头文件的一个摘要。

清单 6. clang FileManager 类

class FileManager : public llvm::RefCountedBase<FileManager> {
  FileSystemOptions FileSystemOpts;
   /// \brief The virtual directories that we have allocated.  For each
  /// virtual file (e.g. foo/bar/baz.cpp), we add all of its parent
  /// directories (foo/ and foo/bar/) here.
  SmallVector<DirectoryEntry*, 4> VirtualDirectoryEntries;
  /// \brief The virtual files that we have allocated.
  SmallVector<FileEntry*, 4> VirtualFileEntries;
 /// NextFileUID - Each FileEntry we create is assigned a unique ID #.
  unsigned NextFileUID;
  // Statistics.
  unsigned NumDirLookups, NumFileLookups;
  unsigned NumDirCacheMisses, NumFileCacheMisses;
 // …
  // Caching.
  OwningPtr<FileSystemStatCache> StatCache;

SourceManager 类通常用来查询 SourceLocation 对象。在 SourceManager.h
头文件中，清单 7提供了有关 SourceLocation 对象的信息。

清单 7. 理解 SourceLocation

/// There are three different types of locations in a file: a spelling
/// location, an expansion location, and a presumed location.
///
/// Given an example of:
/// #define min(x, y) x < y ? x : y
///
/// and then later on a use of min:
/// #line 17
/// return min(a, b);
///
/// The expansion location is the line in the source code where the macro
/// was expanded (the return statement), the spelling location is the
/// location in the source where the macro was originally defined,
/// and the presumed location is where the line directive states that
/// the line is 17, or any other line.

很明显，SourceManager 取决于底层的FileManager；事实上，SourceManager 类构造函数接受一个 FileManager类作为输入参数。最后，您需要跟踪处理源代码时可能出现的错误并进行报告。您可以使用DiagnosticsEngine 类完成这项工作。和 Preprocessor一样，您有两个选择：

独立创建所有必需的对象
使用 CompilerInstance 完成所有工作

让我们使用后一种方法。清单 8显示了 Preprocessor的代码；其他任何事情之前已经解释过了。

清单 8. 使用 clang API 创建一个预处理器

int main()
{
    CompilerInstance ci;
    ci.createDiagnostics(0,NULL); // create DiagnosticsEngine
    ci.createFileManager();  // create FileManager
    ci.createSourceManager(ci.getFileManager()); // create SourceManager
    ci.createPreprocessor();  // create Preprocessor
    const FileEntry *pFile = ci.getFileManager().getFile("hello.c");
    ci.getSourceManager().createMainFileID(pFile);
    ci.getPreprocessor().EnterMainSourceFile();
    ci.getDiagnosticClient().BeginSourceFile(ci.getLangOpts(), &ci.getPreprocessor());
    Token tok;
    do {
        ci.getPreprocessor().Lex(tok);
        if( ci.getDiagnostics().hasErrorOccurred())
            break;
        ci.getPreprocessor().DumpToken(tok);
        std::cerr << std::endl;
    } while ( tok.isNot(clang::tok::eof));
    ci.getDiagnosticClient().EndSourceFile();
}

清单 8使用 CompilerInstance 类依次创建DiagnosticsEngine（ci.createDiagnostics 方法调用）和FileManager（ci.createFileManager 和ci.CreateSourceManager）。使用 FileEntry完成文件关联后，继续处理源文件中的每个令牌，直到达到文件的末尾(EOF)。预处理器的 DumpToken 方法将把令牌转储到屏幕中。

要编译并运行清单 8中的代码，使用清单 9中的makefile（针对您的 clang 和 LLVM安装文件夹进行了相应调整）。主要想法是使用 llvm-config工具提供任何必需的 LLVM（包含路径和库）：您永远不应尝试将这些链接传递到g++ 命令行。

清单 9. 用于构建预处理器代码的 Makefile

CXX := g++
RTTIFLAG := -fno-rtti
CXXFLAGS := $(shell llvm-config --cxxflags) $(RTTIFLAG)
LLVMLDFLAGS := $(shell llvm-config --ldflags --libs)
DDD := $(shell echo $(LLVMLDFLAGS))
SOURCES = main.cpp
OBJECTS = $(SOURCES:.cpp=.o)
EXES = $(OBJECTS:.o=)
CLANGLIBS = \
    -L /usr/local/lib \
    -lclangFrontend \
    -lclangParse \
    -lclangSema \
    -lclangAnalysis \
    -lclangAST \
    -lclangLex \
    -lclangBasic \
    -lclangDriver \
    -lclangSerialization \
    -lLLVMMC \
    -lLLVMSupport \
all: $(OBJECTS) $(EXES)
%: %.o
        $(CXX) -o $@ $< $(CLANGLIBS) $(LLVMLDFLAGS)

编译并运行以上代码后，您应当获得清单 10中的输出。

清单 10. 运行清单 7 中的代码时发生崩溃

Assertion failed: (Target && "Compiler instance has no target!"),
   function getTarget, file
   /Users/Arpan/llvm/tools/clang/lib/Frontend/../..
   /include/clang/Frontend/CompilerInstance.h,
   line 294.
Abort trap: 6

在这里，您遗漏了 CompilerInstance设置的最后一部分：即编译代码所针对的目标平台。这里是 TargetInfo 和TargetOptions 类发挥作用的地方。根据 clang 标头TargetInfo.h，TargetInfo类存储有关代码生成的目标系统的所需信息，并且必须在编译或预处理之前创建。和预期的一样，TargetInfo包含有关整数和浮动宽度、对齐等信息。清单 11提供了TargetInfo.h 头文件的摘要。

清单 11. Clang TargetInfo 类

class TargetInfo : public llvm::RefCountedBase<TargetInfo> {
  llvm::Triple Triple;
protected:
  bool BigEndian;
  unsigned char PointerWidth, PointerAlign;
  unsigned char IntWidth, IntAlign;
  unsigned char HalfWidth, HalfAlign;
  unsigned char FloatWidth, FloatAlign;
  unsigned char DoubleWidth, DoubleAlign;
  unsigned char LongDoubleWidth, LongDoubleAlign;
  // …

TargetInfo 类使用两个参数实现初始化：DiagnosticsEngine 和TargetOptions。在这两个参数中，对于当前平台，后者必须将 Triple字符串设置为相应的值。LLVM 此时将发挥作用。清单 12显示了对清单 9所附加的可以使预处理器工作的内容。

清单 12. 为编译器设置目标选项

int main()
{
    CompilerInstance ci;
    ci.createDiagnostics(0,NULL);
    // create TargetOptions
    TargetOptions to;
    to.Triple = llvm::sys::getDefaultTargetTriple();
    // create TargetInfo
    TargetInfo *pti = TargetInfo::CreateTargetInfo(ci.getDiagnostics(), to);
    ci.setTarget(pti);
    // rest of the code same as in Listing 9…
    ci.createFileManager();
    // …

就这么简单。运行代码并观察简单的 hello.c 测试的输出：

#include <stdio.h>
int main() {  printf("hello world!\n"); }

清单 13展示了部分预处理器输出。

清单 13. 预处理器输出（部分）

typedef 'typedef'
struct 'struct'
identifier '__va_list_tag'
l_brace '{'
unsigned 'unsigned'
identifier 'gp_offset'
semi ';'
unsigned 'unsigned'
identifier 'fp_offset'
semi ';'
void 'void'
star '*'
identifier 'overflow_arg_area'
semi ';'
void 'void'
star '*'
identifier 'reg_save_area'
semi ';'
r_brace '}'
identifier '__va_list_tag'
semi ';'

identifier '__va_list_tag'
identifier '__builtin_va_list'
l_square '['
numeric_constant '1'
r_square ']'
semi ';'

手动创建一个 Preprocessor 对象

clang库的其中一个优点，就是您可以通过多种方法实现相同的效果。在本节中，您将创建一个Preprocessor 对象，但是不需要直接向 CompilerInstance 发出请求。从Preprocessor.h 头文件中，清单 14显示了 Preprocessor的构造函数。

清单 14. 构造一个 Preprocessor 对象

Preprocessor(DiagnosticsEngine &diags, LangOptions &opts,
               const TargetInfo *target,
               SourceManager &SM, HeaderSearch &Headers,
               ModuleLoader &TheModuleLoader,
               IdentifierInfoLookup *IILookup = 0,
               bool OwnsHeaderSearch = false,
               bool DelayInitialization = false);

查看该构造函数，显然，要想让这个预处理器工作，您还需要创建 6个不同的对象。您已经了解了 DiagnosticsEngine、TargetInfo 和SourceManager。CompilerInstance 派生自ModuleLoader。因此您必须创建两个新的对象，一个用于LangOptions，另一个用于 HeaderSearch。LangOptions 类使您编译一组C/C++ 方言，包括 C99、C11 和 C++0x。参考 LangOptions.h 和LangOptions.def 标头，获取更多信息。最后，HeaderSearch 类存储目录的std::vector，用于在其他对象中搜索功能。清单 15显示了Preprocessor 的代码。

清单 15. 手动创建的预处理器

using namespace clang;
int main()  {
    DiagnosticOptions diagnosticOptions;
    TextDiagnosticPrinter *printer =
      new TextDiagnosticPrinter(llvm::outs(), diagnosticOptions);
    llvm::IntrusiveRefCntPtr<clang::DiagnosticIDs> diagIDs;
    DiagnosticsEngine diagnostics(diagIDs, printer);
    LangOptions langOpts;
    clang::TargetOptions to;
    to.Triple = llvm::sys::getDefaultTargetTriple();
    TargetInfo *pti = TargetInfo::CreateTargetInfo(diagnostics, to);
    FileSystemOptions fsopts;
    FileManager fileManager(fsopts);
    SourceManager sourceManager(diagnostics, fileManager);
    HeaderSearch headerSearch(fileManager, diagnostics, langOpts, pti);
    CompilerInstance ci;
    Preprocessor preprocessor(diagnostics, langOpts, pti,
      sourceManager, headerSearch, ci);
    const FileEntry *pFile = fileManager.getFile("test.c");
    sourceManager.createMainFileID(pFile);
    preprocessor.EnterMainSourceFile();
    printer->BeginSourceFile(langOpts, &preprocessor);
    // … similar to Listing 8 here on
}

对于清单 15]中的代码，需要注意以下几点：

您没有初始化 HeaderSearch并使它指向任何特定的目录。但是您应当这样做。
clang API 要求在堆 (heap) 上分配 TextDiagnosticPrinter。在栈(stack) 上分配会引起崩溃。
您还不能处理掉 CompilerInstance。总之是因为您正在使用CompilerInstance，那么为什么还要费心去手动创建它而不是更舒适地使用clang API 呢？

语言选择：C++

您目前为止一直使用的是 C 测试代码：那么使用一些 C++ 代码如何？向清单 15中的代码添加 langOpts.CPlusPlus = 1;，然后尝试使用清单 16中的测试代码。

清单 16. 对预处理器使用 C++ 测试代码

template <typename T, int n>
struct s {
  T array[n];
};
int main() {
  s<int, 20> var;
}

清单 17. 清单 16 中代码的部分预处理器输出

identifier 'template'
less '<'
identifier 'typename'
identifier 'T'
comma ','
int 'int'
identifier 'n'
greater '>'
struct 'struct'
identifier 's'
l_brace '{'
identifier 'T'
identifier 'array'
l_square '['
identifier 'n'
r_square ']'
semi ';'
r_brace '}'
semi ';'
int 'int'
identifier 'main'
l_paren '('
r_paren ')'

创建一个解析树

clang/Parse/ParseAST.h 中定义的 ParseAST 方法是 clang提供的重要方法之一。以下是从 ParseAST.h 复制的一个例程声明：

void ParseAST(Preprocessor &pp, ASTConsumer *C,
       ASTContext &Ctx, bool PrintStats = false,
       TranslationUnitKind TUKind = TU_Complete,
       CodeCompleteConsumer *CompletionConsumer = 0);

ASTConsumer为您提供了一个抽象接口，可以从该接口进行派生。这样做非常合适，因为不同的客户端很可能通过不同的方式转储或处理AST。您的客户端代码将派生自 ASTConsumer。ASTContext类存储有关类型声明的信息和其他信息。最简单的尝试就是使用 clang ASTConsumer API在您的代码中输出一个全局变量列表。许多技术公司就全局变量在 C++代码中的使用有非常严格的要求，这应当作为创建定制 lint工具的出发点。清单 18中提供了定制 consumer 的代码。

清单 18. 定制 AST consumer 类

class CustomASTConsumer : public ASTConsumer {
public:
 CustomASTConsumer () :  ASTConsumer() { }
    virtual ~ CustomASTConsumer () { }
    virtual bool HandleTopLevelDecl(DeclGroupRef decls)
    {
        clang::DeclGroupRef::iterator it;
        for( it = decls.begin(); it != decls.end(); it++)
        {
            clang::VarDecl *vd = llvm::dyn_cast<clang::VarDecl>(*it);
            if(vd)
               std::cout << vd->getDeclName().getAsString() << std::endl;;
        }
        return true;
    }
};

您将使用自己的版本覆盖 HandleTopLevelDecl 方法（最初在 ASTConsumer中提供）。Clang将全局变量列表传递给您；您对该列表进行迭代并输出变量名称。清单19摘录自 ASTConsumer.h，显示了客户端 consumer代码可以覆盖的一些其他方法。

清单 19. 其他一些可以在客户端代码中覆盖的方法

/// HandleInterestingDecl - Handle the specified interesting declaration. This
/// is called by the AST reader when deserializing things that might interest
/// the consumer. The default implementation forwards to HandleTopLevelDecl.
virtual void HandleInterestingDecl(DeclGroupRef D);

/// HandleTranslationUnit - This method is called when the ASTs for entire
/// translation unit have been parsed.
virtual void HandleTranslationUnit(ASTContext &Ctx) {}

/// HandleTagDeclDefinition - This callback is invoked each time a TagDecl
/// (e.g. struct, union, enum, class) is completed.  This allows the client to
/// hack on the type, which can occur at any point in the file (because these
/// can be defined in declspecs).
virtual void HandleTagDeclDefinition(TagDecl *D) {}

/// Note that at this point it does not have a body, its body is
  /// instantiated at the end of the translation unit and passed to
  /// HandleTopLevelDecl.
  virtual void HandleCXXImplicitFunctionInstantiation(FunctionDecl *D) {}

最后，清单 20显示了您开发的定制 AST consumer类的实际客户端代码。

清单 20. 使用定制 AST consumer 的客户端代码

int main() {
    CompilerInstance ci;
    ci.createDiagnostics(0,NULL);
    TargetOptions to;
    to.Triple = llvm::sys::getDefaultTargetTriple();
    TargetInfo *tin = TargetInfo::CreateTargetInfo(ci.getDiagnostics(), to);
    ci.setTarget(tin);
    ci.createFileManager();
    ci.createSourceManager(ci.getFileManager());
    ci.createPreprocessor();
    ci.createASTContext();
    CustomASTConsumer *astConsumer = new CustomASTConsumer ();
    ci.setASTConsumer(astConsumer);
    const FileEntry *file = ci.getFileManager().getFile("hello.c");
    ci.getSourceManager().createMainFileID(file);
    ci.getDiagnosticClient().BeginSourceFile(
       ci.getLangOpts(), &ci.getPreprocessor());
    clang::ParseAST(ci.getPreprocessor(), astConsumer, ci.getASTContext());
    ci.getDiagnosticClient().EndSourceFile();
    return 0;
}