分类
Articles

The Inside Story of CickHouse (2) Expression Evaluation: ActionsChain

This article discusses one of the core issues during the SQL compilation phase in ClickHouse, specifically how expression evaluation is performed.

1. What Problem Does ActionsChain Solve?

The execution of SQL is carried out in an orderly manner according to the steps in the execution plan. For example, the execution order of the following SQL steps is: FROM -> WHERE -> SELECT -> ORDER BY. Each step involves some expressions, and the data corresponding to the expressions generated by each downstream step will serve as the input for the upstream steps. So, which expressions does the downstream step need to output to the upstream step? And how can this process be optimized?

ActionsChain is designed to address this issue.

In an SQL query, expressions are generated not only in the SELECT clause but also in other steps. For example, in the following SQL query, the WHERE clause generates four expressions: b > 1b + cb + c > 2, and b > 1 AND b + c > 2.

query_1: select a + 1, a + 2, b > 1, b + c from t where b > 1 and b + c > 2 order by a + 1;

Issues in the Execution Process of the Above SQL:

  1. When the WHERE clause outputs data to the SELECT clause, should it output the expressions ab, and c, or the four expressions it generated, or all of them?
  2. How can the WHERE clause minimize the amount of data it outputs to the SELECT clause? For example, if there are two expressions related to column a in the SELECT clause, does the WHERE clause need to output column a twice?
  3. Does the WHERE clause need to output column c to the SELECT clause?

ActionsChain is designed to manage the computation of expressions across all steps of the query plan and to optimize the expression output at each step. The goal and principle of this optimization are to minimize the amount of data output upwards. In the following analysis of ActionsChain, these questions will be answered one by one.

P.S.: You can use the command EXPLAIN actions=1 SQL to view the actions between each step in the execution plan.

2. ActionsChain Structure Overview

ActionsChain is a Directed Acyclic Graph (DAG) structure composed of ActionsDAGs. The output of each step serves as the input for the next step. Each ActionsDAG is a DAG formed from a list of expressions.

For the FROM step, since it is the initial step, there is no input; its output is a list of all the original columns used in the entire SQL query. In practice, we often ignore the FROM step and start calculations directly from the WHERE step. The ActionsChain for a simple query is shown below.

query_2: select a, b + 1 from t wher a > 0

3. ActionsChain Construction Process

The diagram above shows the final state of the ActionsChain structure. Here, we introduce the ActionsChain construction process, which mainly includes two steps: construction and pruning. Construction refers to generating the ActionsChain structure, during which all expressions are retained and output to downstream. This means that the generated ActionsChain contains many unnecessary steps. Therefore, pruning is needed. Pruning essentially involves removing expressions that the downstream does not require from the output list, thereby minimizing the data output to downstream as much as possible.

query_3: select a+1, b from t where a > 1;

We can see that all nodes in the WHERE clause are output to the SELECT clause. Specifically, there are three columns of data that need to be output to the SELECT clause: ab, and greater(a, 1). However, greater(a, 1) is not needed, so it needs to be pruned. The ActionsChain after pruning is as follows:

4. Expression Reuse

1.If an expression has already been computed downstream, the upstream can directly reuse the result of that expression. The corresponding ActionsChain is as follows:

query_4: select a+1, b from t where a + 1 > 1

For the expression “a + 1”, once it is computed in the WHERE clause, it will not be recomputed in the later projection stage. Instead, it can be directly obtained from the downstream, thereby improving query efficiency.

2.If the upstream requires multiple instances of the same expression, the downstream only needs to output it once.

query_5: select a+1, a+1 from t where a+1 > 1;

ActionsChain coordinates the expression calculations at each step throughout the entire query. Following the principles of outputting the minimal amount of data upstream and prioritizing the output of highly computed expressions, it significantly reduces data transmission and computation. This greatly enhances the performance of ClickHouse.

This work is licensed under a Creative Commons Attribution 4.0 International License. When redistributing, please include the original link.

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注