Spark SQL CodeGeneration

Volcano Model

Pros

Simple & clean, pipeline mode,operator independent easy to extend

Cons

Too many virtual function calls

Poor code locality & complex book-keeping

not friendly with SIMD

虚函数是 C/C++ 里的概念,主要是为了实现多态,java里所有的普通函数都是虚函数,除非函数加了final/private.至于虚函数为什么cost这么高,大致是两个原因,1 是虚函数调用的机器指令更多，二是cpu cache 不友好

Optimize Volcano Model

（http://www.vldb.org/pvldb/vol11/p2209-kersten.pdf）

Vectorized tuple processing -> (DB2 BLU,columnar SQL Server,QuickStep)

Query compilation (data-entric code generation) -> (Apache Spark,Peloton)

向量化执行在memory-bound类的查询中更有优势，代码生成在calculation-heavy 类的查询中更有优势。但是总体来看,在OLAP场景中,向量化执行和代码生成的执行性能相近

Vectorized vs query compilation

Vectorized vs compiled

Query compilation envolve （https://zhuanlan.zhihu.com/p/60965109）

Compare to orign volcona model, we generate code more close to machine code,and more friendly to machine

Relaxed Operator Fusion for In-Memory Database

(http://www.vldb.org/pvldb/vol11/p1-menon.pdf)

How to compile

SystemR (Machine code) -> hard code

Hyper(LLVM IR) -> The LLVM compiler infrastructure project is a set of compiler and toolchain technologies

SparkSQL(VIrtual Machine bytecode) -> generate java code ->compile to bytecode with janio

Efficiently Compiling Efficient Query Plans for Modern Hardware

Https://www.vldb.org/pvldb/vol4/p539-neuman.pdf

Spark SQL code generation

Https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

Project Tungsten

Elimate cpu & memory bottleneck

1.Memory Management and Binary Processing (To tackle both object overhead and GC’s inefficiency)

2.Cache-aware computation.(designing cache-friendly algorithms and data structures so Spark applications will spend less time waiting to fetch data from memory and more time doing useful work)

3.Code generation

Janino compiler

Janino can not only compile a set of source files to a set of class file like JAVAC,but also compile a Java expression, a block, a class body, one.java file or a set of .java files in memory, load the bytecode and execute it directly in the same JVM

Expression code generation

CodegenContext:记录将要生成代码中的各种元素,比如变量,函数等

CodeGenerator: 一个基类,对外提供代码生成的接口generate,相关的实现有七个,比如GeneratePredict就是实现谓词的codegen

FilterExec 为例

FilterExec.doExecute -> GeneratePredicate.generate -> GeneratePredicate.create

WholeStage code generation

collapseCodegenStages

1.将支持codegen的算子pipeline 在一起,并在外层添加一个WholeStageCodegenExec

2.不支持codegen 的算子上添加一个适配器InputAdapter

codegenSUpport

1.支持wholeStageCodegen 的算子需要实现该接口

2.CodegenSupport 主要包含consume/doConsume 和produce/doProduce两对方法,

consume 和produce 都是final类型,区别在于produce会调用doProduce方法,而consume 会调用父节点的doconsume方法

文档信息

本文作者：Jessica
本文链接：https://jessica0530.github.io/2021/03/08/Spark-SQL-CodeGeneration/
版权声明：自由转载-非商用-非衍生-保持署名（创意共享3.0许可证）

jessica