Calcite 组件和关键词介绍

2020/08/18 Calcite 共 4122 字,约 12 分钟

calcite-stages-of-query-execution

COMPONENTS

Catalog

包括 schema,Table,RelDataType,Statistic

TableStatistic :rowCount行数,和ColumnStat

ColumnStatistic : distinct values ndv,nullCount,avgLen,maxLen,maxValue,minValue

calcite-catalog

SQL parser

1.将有效的SQL查询解析为抽象语法树(AST)

2.用JavaCC编写的LL(k)解析器

3.Input queries被解析为抽象语法树(AST)

4.Tokens在Calcite中由SqlNode表示

5.SqlNode也可以通过unparse方法转换回SQL字符串

JavaCC

1.Java编译器

2.创建于1996年的Sun Microsystem

3.从特定域的语言生成Java代码

4.ANTLR是用于Hive和Drill等项目的替代品

calcite-JavaCC

SQLNode

SqlNode 表示抽象语法树中的元素

calcite-sqlnode

SQL validator

根据Catalog提供的Metadata信息验证抽象语法树

Query optimizer

将AST转换为logical plans,优化 logical plans,并将 logical expressions 转换为 physical plans

Query Plan

Query plans represent the steps necessary to execute a query

calcite-query-plan

Query Optimization

Optimize logical plan Goal is typically to try to reduce the amount of data that must be processed early in the plan Convert logical plan into a physical plan Physical plan is engine specific and represents the physical execution stages

Prune unused fields Merge projections Convert subqueries to joins Reorder joins Push down projections Push down filters

calcite-query-opt

SQL generator

Converts physical plans to SQL

关键词

Relational algebra (RelNode)

RelNode 表示 a relational expression 是处理 data的行为

比如 Aggregate,Sort,Join,Project,Filter 等

RelNode 根据上游的Input数量,抽象出 3类,SingleRel,BiRel,SetOp

RelNode类型具体RelNode
SingleRel (input只有一个)Aggregate,Calc,Scan
BiRel (input 有2个 left,right)Join
SetOp (input有多个)Union, Minus(except) ,Intersect

Logical algebra

比如LogicalAggregate

Physical algebra

Row expressions (RexNode)

RexNode 表示 row级别的 表达式 ,在 validate之后生成 所以有该row的type类型

 {@link RexLiteral} (constant value),
* {@link RexVariable} (variable), {@link RexCall} (call to operator with
* operands). Expressions are generally created using a {@link RexBuilder}
* factory.</p>

RexNode通过 RexBuilder Factory生成

RexInputRefInput column Reference
RexLiteral常量
RexFieldAccessSELECT emp.empno FROM emp
RexCallFunction call ,binary, unary, functions, special syntactic constructs * like CASE … WHEN … END
RexVariablevariable

Traits (RelTrait)

表示 RelNode的 Trait,并不会 改变执行,主要用于 验证 plan output,Flink Retract用到 此Trait 3个重要的 trait types: Convention RelCollation 排序 ASC DESC RelDistribution hash,range,random,boardcast

举例 FlinkRelDistributionTraitDef SINGLETON(“single”), HASH_DISTRIBUTED(“hash”), RANGE_DISTRIBUTED(“range”), RANDOM_DISTRIBUTED(“random”), ROUND_ROBIN_DISTRIBUTED(“rr”), BROADCAST_DISTRIBUTED(“broadcast”), ANY(“any”);

比如在优化算子里面要满足是按key 进行hash Distribution

private def isInputSatisfyRequiredDistribution(input: RelNode, keys: Array[Int]): Boolean = {
    val requiredDistribution = createDistribution(keys)
    val inputDistribution = input.getTraitSet.getTrait(FlinkRelDistributionTraitDef.INSTANCE)
    inputDistribution.satisfies(requiredDistribution)
  }

Conventions (Convention)

Convention is a type of RelTrait

A Convention is associated with a RelNode interface

SparkConvention, JdbcConvention, EnumerableConvention, etc

Conventions are used to represent a single data source

Inputs to a relational expression must be in the same convention

METADATA

   
Selectivity  
UniqueKeys  
RowCount  
DistinctRowCount  
PercentageOriginalRows  
ColumnUniqueness  
ColumnOrigin  
Predicates  
Collation  
Distribution  
Size  
Parallelism  
Memory  
AllPredicates  
ExpressionLineage  
TableReferences  

Rules (RelOptRule)

Rules 用于修改 query plans

一共有2种类型:converters and transformers Rules

Converter rules implement Converter and convert from one convention to another

Rules are matched to elements of a query plan using pattern matching

onMatch is called for matched rules

Converter rules applied via convert

Converter Rule

calcite-converter-rule

Pattern Match

calcite-pattern-match

Planners (RelOptPlanner)

Planners implement the RelOptPlanner interface

Two types of planners: HepPlanner VolcanoPlanner

Heuristic Optimization

HepPlanner is a heuristic optimizer similar to Spark’s optimizer Applies all matching rules until none can be applied Heuristic optimization is faster than cost- based optimization Risk of infinite recursion if rules make opposing changes to the plan

Cost-based Optimization

VolcanoPlanner is a cost-based optimizer Applies matching rules iteratively, selecting the plan with the cheapest cost on each iteration Costs are provided by relational expressions Not all possible plans can be computed Stops optimization when the cost does not significantly improve through a determinable number of iterations

Cost is provided by each RelNode Cost is represented by RelOptCost Cost typically includes row count, I/O, and CPU cost Cost estimates are relative Statistics are used to improve accuracy of cost estimations Calcite provides utilities for computing various resource-related statistics for use in cost estimations

calcite-cbo

Programs (Program)

文档信息

Search

    Table of Contents