Aggregation
count
count(*)
to get the row number, or count(col)
to get the number of rows when col
is not NULL
count_distinct
count_distinct(col)
to get the number of unique values for the col
column. Same as count(distinct col)
count_if
count_if(condition)
to apply a filter with condition
and get the number of records. e.g. count_if(speed_kmh>80)
distinct
distinct(col)
to get the distinct value for the col
column.
unique
unique(<column_name1>[, <column_name2>, ...])
: Calculates the approximate number of different values of the columns.
unique_exact
unique_exact(<column_name1>[, <column_name2>, ...])
Calculates the exact number of different values of the columns.
unique_exact_if
unique_exact_if(col,condition)
to apply a filter with condition
and get the distinct count of col
, e.g. to get the cars with high speed unique_exact_if(cid,speed_kmh>80)
min
min(<column_name>)
: minimum value of a column. For String column, the comparison is lexicographic order.
max
max(<column_name>)
: maximum value of a column. For String column, the comparison is lexicographic order.
sum
sum(<column_name>)
: sum of the columns. Only works for numbers.
avg
avg(<column_name>)
: average value of a column (sum(column) / count(column)). Only works for numeric columns.
median
median(<column_name>)
Calculate median of a numeric data sample.
quantile
quantile(column,level)
Calculate an approximate quantile of a numeric data sequence. e.g. quantile(a,0.9)
to get the P90 for the column and quantile(a,0.5)
to get the median number
p90
short for quantile(a,0.9)
p95
short for quantile(a,0.95)
p99
short for quantile(a,0.99)
top_k
top_k(<column_name>,K)
: Top frequent K items in column_name. Return an array.
e.g. top_k(cid, 3)
may get ["c01","c02","c03"]
If you need to get the event count, you can set true
as the 3rd parameter, e.g. top_k(cid, 3, true)
may get [("c01",1200,0),("c02",800,0),("c03",700,0)]
. The 3rd element in the tuple is the probability of the error of the calculation, since it is an approximate calculation. 0 means no error. If you want to get the exact top K without approximate calculation, you can use top_k_exact
function, e.g. top_k_exact(cid,3)
, which will be slower than top_k
function.
Read more on Top-N Query Pattern page.
min_k
min_k(<column_name>,K [,context_column])
: The least K items in column_name. Return an array. You can also add a list of columns to get more context of the values in same row, such as min_k(price,3,product_id,last_updated)
This will return an array with each element as a tuple, such as [(5.12,'c42664'),(5.12,'c42664'),(15.36,'c84068')]
Read more on Top-N Query Pattern page.
max_k
max_k(<column_name>,K[,context_column])
: The greatest K items in column_name. You can also add a list of columns to get more context of the values in same row, such as max_k(price,3,product_id,last_updated)
Read more on Top-N Query Pattern page.
arg_min
arg_min(argument, value_column)
Gets the value in the argument
column for a minimal value in the value_column
. If there are several different values of argument
for minimal values of value_column
, it returns the first of these values encountered. You can achieve the same query with min_k(value_column,1, argument)[1].2
. But this is much easier.
arg_max
arg_max(argument, value_column)
Gets the value in the argument
column for a maximum value in the value_column
. If there are several different values of argument
for maximum values of value_column
, it returns the first of these values encountered. You can achieve the same query with max_k(value_column,1, argument)[1].2
. But this is much easier.
group_array
group_array(<column_name>)
to combine the values of the specific column as an array. For example, if there are 3 rows and the values for these columns are "a","b","c". This function will generate a single row and single column with value ['a','b','c']
group_uniq_array
group_uniq_array(<column_name>)
to combine the values of the specific column as an array, making sure only unique values in it. For example, if there are 3 rows and the values for these columns are "a","a","c". This function will generate a single row and single column with value ['a','c']
moving_sum
moving_sum(column)
returns an array with the moving sum of the specified column. For example, select moving_sum(a) from(select 1 as a union select 2 as a union select 3 as a)
will return [1,3,6]
any
any(column)
Selects the first encountered (non-NULL) value, unless all rows have NULL values in that column. The query can be executed in any order and even in a different order each time, so the result of this function is indeterminate. To get a determinate result, you can use the min
or max
function instead of any
.
first_value
first_value(column)
Selects the first encountered value.
last_value
last_value(column)
Selects the last encountered value.
stochastic_linear_regression_state
stochastic_linear_regression_state(num, target, param1, param2)
This function implements stochastic linear regression. It supports custom parameters for learning rate, L2 regularization coefficient, mini-batch size and has few methods for updating weights (Adam (used by default), simple SGD, Momentum, Nesterov). Learn more at ClickHouse docs.
stochastic_logistic_regression
stochastic_logistic_regression(num, num, num, string)
This function implements stochastic logistic regression. It can be used for binary classification problem, supports the same custom parameters as stochasticLinearRegression and works the same way. Learn more at ClickHouse docs.
largest_triangle_three_buckets
largest_triangle_three_buckets(x, y, n)
or lttb(x, y, n)
. x
is the x coordinate. y
is the y coordinate. n
is the number of points in the resulting series.
Applies the Largest-Triangle-Three-Buckets algorithm to the input data. The algorithm is used for downsampling time series data for visualization. It is designed to operate on series sorted by x coordinate. It works by dividing the sorted series into buckets and then finding the largest triangle in each bucket. The number of buckets is equal to the number of points in the resulting series. The function will sort data by x
and then apply the downsampling algorithm to the sorted data.
For example:
CREATE STREAM test
(
x float64,
y float64
) ENGINE = MergeTree order by (y,x);
INSERT INTO test
VALUES (1.0, 10.0),(2.0, 20.0),(3.0, 15.0),(8.0, 60.0),(9.0, 55.0),(10.0, 70.0),(4.0, 30.0),(5.0, 40.0),(6.0, 35.0),(7.0, 50.0);
select largest_triangle_three_buckets(x, y, 0) FROM test;
select largest_triangle_three_buckets(x, y, 1) FROM test;
select largest_triangle_three_buckets(x, y, 2) FROM test;
SELECT largest_triangle_three_buckets(x, y, 4) FROM test;
lttb
Alias for largest_triangle_three_buckets
.
avg_time_weighted
avg_time_weighted(column, time_column)
to calculate the time-weighted average of the column. The time column should be in the format of datetime
,datetime64
or date
.
Optionally, you can add a third parameter to specify an end time for your analysis period. When you omit the third parameter, the calculation excludes the last value. If provided, the end time must match the timestamp column's data type, and the function uses the difference between the last time point and this end time as the weight for the final value.
median_time_weighted
median_time_weighted(column, time_column)
to calculate the time-weighted median of the column. The time column should be in the format of datetime
,datetime64
or date
.
This function also takes an optional third parameter to specify an end time for your analysis period.