This article discusses one of the core issues during the SQL compilation phase in ClickHouse, specifically how expression evaluation is performed.
1. What Problem Does ActionsChain Solve?
The execution of SQL is carried out in an orderly manner according to the steps in the execution plan. For example, the execution order of the following SQL steps is: FROM -> WHERE -> SELECT -> ORDER BY
. Each step involves some expressions, and the data corresponding to the expressions generated by each downstream step will serve as the input for the upstream steps. So, which expressions does the downstream step need to output to the upstream step? And how can this process be optimized?
ActionsChain is designed to address this issue.
In an SQL query, expressions are generated not only in the SELECT
clause but also in other steps. For example, in the following SQL query, the WHERE
clause generates four expressions: b > 1
, b + c
, b + c > 2
, and b > 1 AND b + c > 2
.
query_1: select a + 1, a + 2, b > 1, b + c from t where b > 1 and b + c > 2 order by a + 1;
Issues in the Execution Process of the Above SQL:
- When the
WHERE
clause outputs data to theSELECT
clause, should it output the expressionsa
,b
, andc
, or the four expressions it generated, or all of them? - How can the
WHERE
clause minimize the amount of data it outputs to theSELECT
clause? For example, if there are two expressions related to columna
in theSELECT
clause, does theWHERE
clause need to output columna
twice? - Does the
WHERE
clause need to output columnc
to theSELECT
clause?
ActionsChain is designed to manage the computation of expressions across all steps of the query plan and to optimize the expression output at each step. The goal and principle of this optimization are to minimize the amount of data output upwards. In the following analysis of ActionsChain, these questions will be answered one by one.
P.S.: You can use the command EXPLAIN actions=1 SQL
to view the actions between each step in the execution plan.
2. ActionsChain Structure Overview
ActionsChain is a Directed Acyclic Graph (DAG) structure composed of ActionsDAGs. The output of each step serves as the input for the next step. Each ActionsDAG is a DAG formed from a list of expressions.
For the FROM
step, since it is the initial step, there is no input; its output is a list of all the original columns used in the entire SQL query. In practice, we often ignore the FROM
step and start calculations directly from the WHERE
step. The ActionsChain for a simple query is shown below.
query_2: select a, b + 1 from t wher a > 0
3. ActionsChain Construction Process
The diagram above shows the final state of the ActionsChain structure. Here, we introduce the ActionsChain construction process, which mainly includes two steps: construction and pruning. Construction refers to generating the ActionsChain structure, during which all expressions are retained and output to downstream. This means that the generated ActionsChain contains many unnecessary steps. Therefore, pruning is needed. Pruning essentially involves removing expressions that the downstream does not require from the output list, thereby minimizing the data output to downstream as much as possible.
query_3: select a+1, b from t where a > 1;
We can see that all nodes in the WHERE
clause are output to the SELECT
clause. Specifically, there are three columns of data that need to be output to the SELECT
clause: a
, b
, and greater(a, 1)
. However, greater(a, 1)
is not needed, so it needs to be pruned. The ActionsChain after pruning is as follows:
4. Expression Reuse
1.If an expression has already been computed downstream, the upstream can directly reuse the result of that expression. The corresponding ActionsChain is as follows:
query_4: select a+1, b from t where a + 1 > 1
For the expression “a + 1”, once it is computed in the WHERE
clause, it will not be recomputed in the later projection stage. Instead, it can be directly obtained from the downstream, thereby improving query efficiency.
2.If the upstream requires multiple instances of the same expression, the downstream only needs to output it once.
query_5: select a+1, a+1 from t where a+1 > 1;
ActionsChain coordinates the expression calculations at each step throughout the entire query. Following the principles of outputting the minimal amount of data upstream and prioritizing the output of highly computed expressions, it significantly reduces data transmission and computation. This greatly enhances the performance of ClickHouse.
This work is licensed under a Creative Commons Attribution 4.0 International License. When redistributing, please include the original link.