CREATE DICTIONARY
Starting from Timeplus Enterprise 2.7, you can create dictionaries to store key-value pairs in memory, with data from various sources, such as files, MySQL/ClickHouse databases, or streams in Timeplus.
When you create a dictionary in a cluster, the dictionary is automatically replicated to all the nodes in the cluster.
Syntax
You can create a dictionary with the following SQL:
CREATE [OR REPLACE] DICTIONARY [IF NOT EXISTS] [db.]dictionary_name(
key1 type1 [DEFAULT|EXPRESSION expr1],
key2 type2 [DEFAULT|EXPRESSION expr2],
attr1 type2 [DEFAULT|EXPRESSION expr3],
attr2 type2 [DEFAULT|EXPRESSION expr4])
PRIMARY KEY key1, key2
SOURCE(SOURCE_NAME([param1 value1 ... paramN valueN]))
LAYOUT(LAYOUT_NAME([param_name param_value]))
LIFETIME({MIN min_val MAX max_val | max_val})
SETTINGS(setting_name = setting_value, setting_name = setting_value, ...)
COMMENT 'Comment'
PRIMARY KEY
The PRIMARY KEY
clause specifies the key columns of the dictionary. You can specify multiple columns as the primary key.
SOURCE
The source for a dictionary can be a:
- stream in the current Timeplus service
- stream in a remote Timeplus service
- file available by HTTP(S)
- another database, such as MySQL or PostgreSQL
Local Timeplus Stream
You can create a dictionary from a stream in the current Timeplus service. Mutable streams are recommended for dictionary sources, as they can be updated in real-time.
Syntax:
SOURCE(TIMEPLUS(STREAM 'stream_name' USER 'user' PASSWORD 'password'))
Note:
- You need specify either
STREAM
orQUERY
. A sampleQUERY
is:SELECT key_column, second_column, third_column FROM table(stream1)
- The
USER
andPASSWORD
are optional. If you don't specify them, Timeplus will use thedefault
user and empty password, which only works if you have set up the default user in the Timeplus configuration file. You can also use thecurrent_user()
function to get the current user.
Please check the example in the Create an ip_trie dictionary from a Timeplus stream section.
Remote Timeplus Stream
Similar to the local stream, you can create a dictionary from a stream in a remote Timeplus service.
Syntax:
SOURCE(TIMEPLUS(HOST 'remotehost' PORT tcp_port() STREAM 'stream_name' USER 'user' PASSWORD 'password'))
ClickHouse Table
You can create a dictionary from a table in a remote ClickHouse service.
Syntax:
SOURCE(CLICKHOUSE(HOST 'remotehost' PORT 9000 SECURE 0|1 USER 'user' PASSWORD 'password' TABLE 'table_name' DB 'database_name' ))
Either one of the TABLE
or QUERY
fields must be declared.
MySQL Table
You can create a dictionary from a table in a remote MySQL database.
Syntax:
SOURCE(MYSQL(HOST 'remotehost' PORT 3306 USER 'user' PASSWORD 'password' TABLE 'table_name' DB 'database_name' ))
Note:
- Either one of the
TABLE
orQUERY
fields must be declared. - You can optionally specify
BG_RECONNECT true
to enable background reconnection to MySQL.
Please check the example in the Create a dictionary from a MySQL table section. You can optionally setup a mutable stream as the cache for the MySQL table, please check the Create a dictionary from a MySQL table with a mutable stream as the cache section.
Remote File
You can create a dictionary from a file available by HTTP(S).
Syntax:
SOURCE(HTTP(URL 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/taxi_zone_lookup.csv' FORMAT 'CSVWithNames'))
Please check the example in the Create a dictionary from a CSV file section.
Local File
You can create a dictionary from a local file.
Syntax:
SOURCE(FILE(path '/var/lib/timeplusd/user_files/dict.tsv' format 'CSV'))
Please note the file path needs to be located in the user_files
directory to prevent unauthorized access to the file system. Make sure each node in the cluster has the same file in the same location.
Local Executable
You can create a dictionary from a local executable.
Syntax:
SOURCE(EXECUTABLE(command '/var/lib/timeplusd/user_files/dict.sh' format 'TabSeparated'))
The executable should output data in the format of the specified format, in this case, TabSeparated. The executable should be located in the user_files
directory.
#!/bin/bash
echo -e "1\tp111"
echo -e "2\tp222"
echo -e "3\tp333"
Executable Pool
Executable pool allows loading data from pool of processes. This source does not work with dictionary layouts that need to load all data from source. Executable pool works if the dictionary is stored using cache, complex_key_cache, ssd_cache, complex_key_ssd_cache, direct, or complex_key_direct layouts.
SOURCE(EXECUTABLE_POOL(command '/var/lib/timeplusd/user_files/dict.sh' format 'TabSeparated'))
LAYOUT
The LAYOUT
clause specifies how the dictionary data is stored in memory. The available layout options are listed below. Please note the layout options and their settings are case-insensitive.
FLAT
The FLAT
layout stores the dictionary data in a flat array. This layout provides the best performance, with the drawback of higher memory usage.
The default INITIAL_ARRAY_SIZE
is 1,024 and default MAX_ARRAY_SIZE
is 500,000.
LAYOUT(FLAT(INITIAL_ARRAY_SIZE 50000 MAX_ARRAY_SIZE 5000000))
HASHED
The HASHED
layout stores the dictionary data in a hash table. This layout provides better memory usage than the FLAT
layout, with slightly lower performance. The dictionary can contain any number of elements with any identifiers.
-- default values
LAYOUT(HASHED())
-- custom values
LAYOUT(HASHED([SHARDS 1] [SHARD_LOAD_QUEUE_BACKLOG 10000] [MAX_LOAD_FACTOR 0.5]))
SPARSE_HASHED
The SPARSE_HASHED
layout is similar to the HASHED
layout, but it is optimized for sparse data. It uses less memory than the HASHED
layout.
LAYOUT(SPARSE_HASHED([SHARDS 1] [SHARD_LOAD_QUEUE_BACKLOG 10000] [MAX_LOAD_FACTOR 0.5]))
COMPLEX_KEY_HASHED
The COMPLEX_KEY_HASHED
layout is similar to the HASHED
layout, but it supports composite keys.
LAYOUT(COMPLEX_KEY_HASHED([SHARDS 1] [SHARD_LOAD_QUEUE_BACKLOG 10000] [MAX_LOAD_FACTOR 0.5]))
COMPLEX_KEY_SPARSE_HASHED
The COMPLEX_KEY_SPARSE_HASHED
layout is similar to the SPARSE_HASHED
layout, but it supports composite keys.
LAYOUT(COMPLEX_KEY_SPARSE_HASHED([SHARDS 1] [SHARD_LOAD_QUEUE_BACKLOG 10000] [MAX_LOAD_FACTOR 0.5]))
HASHED_ARRAY
The HASHED_ARRAY
layout stores the dictionary data in memory. Each attribute is stored in an array. The key attribute is stored in the form of a hashed table where value is an index in the attributes array. The dictionary can contain any number of elements with any identifiers.
-- default values
LAYOUT(HASHED_ARRAY())
-- custom values
LAYOUT(HASHED_ARRAY([SHARDS 1]))
COMPLEX_KEY_HASHED_ARRAY
The COMPLEX_KEY_HASHED_ARRAY
layout is similar to the HASHED_ARRAY
layout, but it supports composite keys.
LAYOUT(COMPLEX_KEY_HASHED_ARRAY([SHARDS 1]))
RANGE_HASHED
The dictionary is stored in memory in the form of a hash table with an ordered array of ranges and their corresponding values.
LAYOUT(RANGE_HASHED(range_lookup_strategy 'max'))
RANGE(MIN StartTimeStamp MAX EndTimeStamp)
COMPLEX_KEY_RANGE_HASHED
The COMPLEX_KEY_RANGE_HASHED
layout is similar to the RANGE_HASHED
layout, but it supports composite keys.
LAYOUT(COMPLEX_KEY_RANGE_HASHED())
RANGE(MIN StartDate MAX EndDate);
CACHE
The CACHE
layout is stored in a cache that has a fixed number of cells. These cells contain frequently used elements.
LAYOUT(CACHE(SIZE_IN_CELLS 1000000000))
COMPLEX_KEY_CACHE
The COMPLEX_KEY_CACHE
layout is similar to the CACHE
layout, but it supports composite keys.
LAYOUT(COMPLEX_KEY_CACHE(SIZE_IN_CELLS 1000000000 ALLOW_READ_EXPIRED_KEYS 1 MAX_UPDATE_QUEUE_SIZE 100000 UPDATE_QUEUE_PUSH_TIMEOUT_MILLISECONDS 100 QUERY_WAIT_TIMEOUT_MILLISECONDS 60000 MAX_THREADS_FOR_UPDATES 4));
SSD_CACHE
The SSD_CACHE
layout is similar to the CACHE
layout, but it is optimized for SSD storage.
LAYOUT(SSD_CACHE(BLOCK_SIZE 4096 FILE_SIZE 16777216 READ_BUFFER_SIZE 1048576
PATH '/var/lib/timeplusd/user_files/test_dict'))
COMPLEX_KEY_SSD_CACHE
The COMPLEX_KEY_SSD_CACHE
layout is similar to the SSD_CACHE
layout, but it is optimized for SSD storage.
LAYOUT(COMPLEX_KEY_SSD_CACHE(BLOCK_SIZE 4096 FILE_SIZE 1073741824 READ_BUFFER_SIZE 131072 WRITE_BUFFER_SIZE 1048576 PATH '/var/lib/timeplusd/user_files/products_dict'));
MUTABLE_CACHE
The MUTABLE_CACHE
layout is used to cache frequently accessed keys in a mutable stream. The dictionary will first look up the keys in the mutable stream, and if not found, it will fetch the data from the source.
Syntax:
LAYOUT(MUTABLE_CACHE(DB 'default' STREAM 'mysql_mutable_cache' UPDATE_FROM_SOURCE false|true));
The default value for UPDATE_FROM_SOURCE
is false
. If set to true
, when there is a lookup miss, the dictionary will update the mutable stream with the data from the source. If set to false
, the dictionary will only fetch the data from the source without updating the mutable stream.
COMPLEX_KEY_MUTABLE_CACHE
The COMPLEX_KEY_MUTABLE_CACHE
layout is similar to the MUTABLE_CACHE
layout, but it supports composite keys.
LAYOUT(COMPLEX_KEY_MUTABLE_CACHE(DB 'default' STREAM 'mutable_stream' UPDATE_FROM_SOURCE false));
HYBRID_HASH_CACHE
The HYBRID_HASH_CACHE
layout leverages Timeplus hybrid hash table to store the dictionary in both memory and disk. You need to specify both the TTL
and PATH
parameters.
LAYOUT(HYBRID_HASH_CACHE(TTL 3600 PATH 'path/to/folder'));
DIRECT
The dictionary with DIRECT
layout is not stored in memory and directly goes to the source during the processing of a request.
LAYOUT(DIRECT())
COMPLEX_KEY_DIRECT
The COMPLEX_KEY_DIRECT
layout is similar to the DIRECT
layout, but it supports composite keys.
LAYOUT(COMPLEX_KEY_DIRECT());
IP_TRIE
The IP_TRIE
layout is for mapping network prefixes (IP addresses) to metadata such as ASN.
LAYOUT(IP_TRIE)
Please check the example in the Create an ip_trie dictionary from a Timeplus stream section.
LIFETIME
Timeplus can update the dictionary data automatically. You can specify the update interval with the LIFETIME
clause. The MIN
and MAX
values are in seconds. The MIN
value is the minimum interval between updates, and the MAX
value is the maximum interval between updates.
For example:
LIFETIME(MIN 1 MAX 10)
specifies the dictionary to update after some random time between 1 and 10 seconds.
This can be simplified to:
LIFETIME(10)
You can disable automatic updates by setting both MIN
and MAX
to 0, i.e.
LIFETIME(MIN 0 MAX 0)
Or simply:
LIFETIME(0)
Examples
Create a dictionary from a CSV file
For example, the following SQL creates a dictionary named taxi_zone_dictionary
with data from a CSV file in a public S3 bucket:
CREATE DICTIONARY taxi_zone_dictionary
(
`LocationID` uint16 DEFAULT 0,
`Borough` string,
`Zone` string,
`service_zone` string
)
PRIMARY KEY LocationID
SOURCE(HTTP(URL 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/taxi_zone_lookup.csv' FORMAT 'CSVWithNames'))
LIFETIME(MIN 0 MAX 0)
LAYOUT(HASHED_ARRAY());
You can list the content of the dictionary with the following SQL:
SELECT * FROM taxi_zone_dictionary;
Or check the number of rows in the dictionary with the following SQL:
SELECT COUNT() FROM taxi_zone_dictionary;
Although you can query the dictionary with the regular SELECT .. FROM dict WHERE ..
syntax, it is recommended to use the dict_get* functions for better performance.
-- not recommended
SELECT Borough FROM taxi_zone_dictionary WHERE LocationID=132;
-- 1 row in set. Elapsed: 0.002 sec.
-- recommended
SELECT dict_get('taxi_zone_dictionary', 'Borough', 132);
-- 1 row in set. Elapsed: 0.001 sec.
A common use case for dictionaries is to join them with other data streams. For example, you can join the taxi_zone_dictionary
with a stream taxi_trips
to get the borough of the pickup location:
SELECT *
FROM taxi_trips
JOIN taxi_zone_dictionary ON taxi_trips.pickup_nyct2010_gid = taxi_zone_dictionary.LocationID
WHERE dropoff_nyct2010_gid = 132 OR dropoff_nyct2010_gid = 138
Under the hood, Timeplus will use the dict_get function to look up the value from the dictionary. Depending on the LIFETIME settings, the dictionary will be updated automatically during the JOIN query execution.
Create an ip_trie dictionary from a Timeplus stream
For example, you can create a mutable stream in Timeplus, and add some sample records to it:
CREATE MUTABLE STREAM my_ip_addresses (
prefix string,
asn uint32,
cca2 string
)
PRIMARY KEY prefix;
INSERT INTO my_ip_addresses(prefix,asn,cca2) VALUES
('202.79.32.0/20', 17501, 'NP'),
('2620:0:870::/48', 3856, 'US'),
('2a02:6b8:1::/48', 13238, 'RU'),
('2001:db8::/32', 65536, 'ZZ');
Then you can create a dictionary, referencing the stream:
CREATE DICTIONARY my_ip_trie_dictionary (
prefix string,
asn uint32,
cca2 string DEFAULT '??'
)
PRIMARY KEY prefix
SOURCE(TIMEPLUS(STREAM 'my_ip_addresses' USER 'admin' PASSWORD 'changeme'))
LAYOUT(IP_TRIE)
LIFETIME(10);
This will create an IP_TRIE
dictionary, which is optimized for mapping network prefixes (IP addresses) to metadata such as ASN. The dictionary will be updated every 0 to 10 seconds.
Then you can query the dictionary with the dict_get
function:
SELECT dict_get('my_ip_trie_dictionary', 'cca2', to_ipv4('202.79.32.10')) AS result;
-- returns 'NP'
SELECT dict_get('my_ip_trie_dictionary', 'asn', ipv6_string_to_num('2001:db8::1')) AS result;
-- returns 65536
SELECT dict_get('my_ip_trie_dictionary', ('asn', 'cca2'), ipv6_string_to_num('2001:db8::1')) AS result;
-- returns (65536, 'ZZ')
To demonstrate the update capability of the dictionary, you can insert a new record into the mutable stream:
INSERT INTO my_ip_addresses(prefix,asn,cca2) VALUES
('2001:db8::/32', 65536, 'BB');
Then run the query again:
SELECT dict_get('my_ip_trie_dictionary', 'cca2', ipv6_string_to_num('2001:db8::1')) AS result;
It will return BB
instead of ZZ
.
Create a dictionary from a MySQL table
For example, you can create a table in MySQL with the TPCH schema:
CREATE TABLE "region" (
"r_regionkey" int DEFAULT NULL,
"r_name" mediumtext,
"r_comment" mediumtext
);
Then you can create a dictionary, referencing the table:
CREATE DICTIONARY mysql_region (
r_regionkey uint64,
r_name string,
r_comment string
)
PRIMARY KEY r_regionkey
SOURCE(MYSQL(HOST 'mysql-timeplus.g.aivencloud.com' PORT 28851 USER 'avnadmin' PASSWORD '..' TABLE 'region' DB 'tpch'))
LIFETIME(0)
LAYOUT(FLAT());
Then you can query the dictionary with the dict_get
function:
SELECT dict_get('mysql_region','r_name',2);
-- returns 'ASIA'
Create a dictionary from a MySQL table with a mutable stream as the cache
You can create a dictionary which looks up data from a mutable stream in Timeplus, and fetches the data from a MySQL table if some keys are not found in the stream. This is useful when there are a large number of keys in the MySQL table, and you want to cache the most frequently queried keys in the mutable stream. You can even setup a CDC pipeline to keep the mutable stream up-to-date, as a proactive way to make sure the frequently queried keys are always in the cache.
As an example, you can create a MySQL database with the TPCH schema. There are 150,000 rows in the customer
table.
CREATE TABLE customer (
"c_custkey" bigint DEFAULT NULL,
"c_name" mediumtext,
"c_address" mediumtext,
"c_nationkey" int DEFAULT NULL,
"c_phone" mediumtext,
"c_acctbal" decimal(24,6) DEFAULT NULL,
"c_mktsegment" mediumtext,
"c_comment" mediumtext
);
You can create a mutable stream in Timeplus, and insert the first 10 records to it:
CREATE MUTABLE STREAM mysql_mutable_cache
(
c_custkey uint64,
c_name string,
c_phone string
)
PRIMARY KEY c_custkey;
INSERT INTO mysql_mutable_cache(c_custkey, c_name, c_phone) VALUES
(1, 'Customer#000000001','25-989-741-2988'),
(2, 'Customer#000000002','23-768-687-3665'),
(3, 'Customer#000000003','11-719-748-3364'),
(4, 'Customer#000000004','14-128-190-5944'),
(5, 'Customer#000000005','13-750-942-6364'),
(6, 'Customer#000000006','30-114-968-4951'),
(7, 'Customer#000000007','21-555-247-5051'),
(8, 'Customer#000000008','17-663-144-5538'),
(9, 'Customer#000000009','13-849-247-6831'),
(10, 'Customer#000000010','33-373-373-6083');
Then you can create a dictionary, referencing the mutable stream and the MySQL table:
CREATE DICTIONARY mysql_dict_mutable(
c_custkey uint64,
c_name string,
c_phone string
)
PRIMARY KEY c_custkey
SOURCE(MYSQL(DB 'tpch' TABLE 'customer' HOST 'host' PORT 3306 USER 'root' PASSWORD 'pwd' BG_RECONNECT true))
LAYOUT(MUTABLE_CACHE(DB 'default' STREAM 'mysql_mutable_cache' UPDATE_FROM_SOURCE false));
Then you can query the dictionary with the dict_get
function:
SELECT dict_get('mysql_dict_mutable','c_phone',10);
-- returns '33-373-373-6083'
SELECT dict_get('mysql_dict_mutable','c_phone',11);
-- returns '33-464-151-3439', which is not available in the mutable stream, but in the MySQL table
You can update the mutable stream with the following SQL:
INSERT INTO mysql_mutable_cache(c_custkey, c_name, c_phone) VALUES
(10, 'Customer#000000010','12-123-123-1234');
Then run the query again:
SELECT dict_get('mysql_dict_mutable','c_phone',10);
-- returns '12-123-123-1234' from the mutable stream, without fetching from the MySQL table.
To keep the mutable stream cache up-to-date, you can setup a CDC pipeline with Redpanda Connect or Debezium. The INSERT or UDPATE in the MySQL table will be captured and sent to the mutable stream. You can even filter the CDC events to only capture the keys you are interested in.
You can also use the dictionary in a JOIN query with other streams:
SELECT * FROM orders JOIN mysql_dict_mutable AS customers
ON orders.customer_id = customers.c_custkey
SETTINGS join_algorithm = 'direct';
Limitations
- Creating a dictionary from PostgreSQL is not supported.
See also
- SHOW DICTIONARIES - Show dictionaries
- DROP DICTIONARY - Drop dictionaries