Skip to main content

Python UDF

In addition to Remote UDF and JavaScript UDF, starting from v2.7, Timeplus Enterprise also supports Python-based UDF, as a feature in technical preview. You can develop User-defined scalar functions (UDFs) or User-defined aggregate functions (UDAFs) with the embedded Python 3.10 runtime in Timeplus core engine. No need to deploy extra server/service for the UDF.

Why Python UDF

Python is recognized as one of the most popular languages in the field of data science. Its flexibility as a scripting language, ease of use, and extensive range of statistical libraries make it an indispensable tool for data scientists and analysts.

Python excels in writing complex parsing and data transformation logic, especially in scenarios where SQL capabilities are insufficient. Python User-Defined Functions (UDFs) offer the flexibility to implement intricate data processing mechanisms. These include:

  • Custom Tokenization: Breaking down data into meaningful elements based on specific criteria.
  • Data Masking: Concealing sensitive data elements to protect privacy.
  • Data Editing: Modifying data values according to specific rules or requirements.
  • Encryption Mechanisms: Applying encryption to data for security purposes.

Data type mapping

This is the mapping for Timeplus data type and Python data type:

Timeplus Data TypePython Type
boolbool
uint8, uint16, uint32, uint64int
int8, int16, int32, int64int
date, date32,datetimeint
float32, float64float
date, date32datetime.date
datetime, datetime64datetime.datetime
string, fixed_stringstr
arraylist
tupletuple
mapdict
ipv4int
uint128,uint256,int128,int256N/A
decimalN/A
ipv6N/A
nullableN/A
low_cardinalityN/A

If your use cases require more data type support, please contact us at support@timeplus.com.

Register a Python UDF

You can create or replace a Python UDF with SQL. Web UI will be added.

Scalar UDF

Scalar UDF is stateless UDF to convert columns in one row to other values.

Syntax:

CREATE OR REPLACE FUNCTION udf_name(param1 type1,..)
RETURNS type2 LANGUAGE PYTHON AS
$$
import

def udf_name(col1..):


$$
SETTINGS ...

You need to make sure the SQL function name is identical to the function name in the Python code.

UDAF

UDAF or User Defined Aggregation Function is stateful. It takes one or more columns from a set of rows and return the aggregated result.

Syntax:

CREATE OR REPLACE AGGREGATION FUNCTION uda_name(param1 type1,...)
RETURNS type2 language PYTHON AS
$$
import ...
class uda_name:
def __init__(self):
...

def serialize(self):
...

def deserialize(self, data):
...

def merge(self, data):
...

def process(self, values):
...
def finalize(self):
...
$$
SETTINGS ...

The function list:

  • process the core logic of the aggregation function, required.
  • finalize return the aggregation result, required.
  • serialize save the state as a string or pickle binary and put in checkpoint, optional.
  • deserialize load the state from checkpoint to the internal state, optional.
  • merge for multi-shard processing, merge the states from each shard, optional.

Examples

A simple UDF without dependency

Timeplus Python UDF supports the standard Python library and the built-in functions. This example takes the number as input, add 5.

CREATE OR REPLACE FUNCTION add_five(value uint16) RETURNS int LANGUAGE PYTHON AS $$
def add_five(value):
for i in range(len(value)):
value[i] = value[i] + 5
return value
$$;

Please note:

  • To improve the performance, Timeplus calls the UDF with a batch of inputs. The input of the Python function add_five is list(int).
  • The function name add_five in the SQL statement should match the function name in the Python code block.
  • Python code block should be enclosed in $$. Alternatively, you can use ' to enclose the code block, but this may cause issues with the Python code block if it contains '.
  • Python code is indented with spaces or tabs. It's recommended to put def at the beginning of the line without indentation.

A simple UDF with numpy

Numpy is a general-purpose array-processing package. It provides a high-performance multidimensional array object, and tools for working with these arrays. It is the fundamental package for scientific computing with Python.

This library is not installed by default. You need to install it manually by following the guide.

This example takes the number as input, add 5 via numpy.

CREATE OR REPLACE FUNCTION add_five(value uint16)
RETURNS uint16 LANGUAGE PYTHON AS $$
import numpy as np
def add_five(value):
np_arr = np.array(value)
np_arr += 5
return np_arr.tolist()
$$

Please note, to improve the performance, Timeplus calls the UDF with a batch of inputs. The input of the Python function add_five is list(int). We use numpy.array(list) to convert it to a numpy array.

A simple UDAF with pickle

Pickle implements binary protocols for serializing and de-serializing a Python object structure.

This example gets the maximum number and use pickle to save/load the state.

CREATE OR REPLACE AGGREGATE FUNCTION getMax(value uint16) RETURNS uint16 LANGUAGE PYTHON AS $$
import pickle
class getMax:
def __init__(self):
self.max = 0

def serialize(self):
data = {}
data['max'] = self.max
return pickle.dumps(data)

def deserialize(self, data):
data = pickle.loads(data)
self.max = data['max']

def merge(self, other):
if (other.max > self.max):
self.max = other.max

def process(self, values):
for item in values:
if item > self.max:
self.max = item
def finalize(self):
return [self.max]
$$;

Manage Python Libraries

By default, Timeplus Enterprise ships a clean Python 3.10 environment, plus the following essential libraries:

  • pip
  • setuptools
  • six
  • wheel

All the dependencies for those libraries are also preinstalled, such as pickle.

Install Python Libraries

To install new Python libraries, you can call the REST API of timeplusd in Timeplus Enterprise v2.7. In the future, we will provide a more user-friendly way to install Python libraries.

info

The following curl sample commands assume the timeplusd server is running on localhost:8123, with default as the user with an empty password. More commonly, you need to set the HTTP headers x-timeplus-user and x-timeplus-key with the user and password.

For example, if you want to install the numpy library, you can use the following command:

curl -X POST http://localhost:8123/timeplusd/v1/python_packages -H "Content-Type: text/plain; charset=utf-8" -d '{"packages": [{"name": "numpy"}]}'

If you need to install a specific version of a library, you can specify it in the version field. For example, to install numpy version 2.2.3, you can use the following command:

curl -X POST http://localhost:8123/timeplusd/v1/python_packages -H "Content-Type: text/plain; charset=utf-8" -d '{"packages": [{"name": "numpy", "version": "2.2.3"}]}'

List Python Libraries

To list the extra Python libraries installed in Timeplus Enterprise, you can use the following command:

curl http://localhost:8123/timeplusd/v1/python_packages

Delete Python Libraries

To delete Python libraries, you can call the REST API of timeplusd in Timeplus Enterprise.

For example, if you want to delete the numpy library, you can use the following command:

curl -X DELETE http://localhost:8123/timeplusd/v1/python_packages/numpy

Update Python Libraries

Currently we don't support updating Python libraries. You can delete the library and reinstall it with the desired version.

Limitations

Timeplus Enterprise v2.7 is the first version that supports Python UDFs. The following limitations apply:

  • Python UDFs are only available in Linux x86_64 bare metal or containerized deployments.
  • For Linux x86_64 bare metal deployments, Glibc version 2.35 or higher is required.
  • Only Python 3.10 is supported. Contact us if you need to install a specific version.
  • Not all Python libraries can be installed in Timeplus Enterprise. Contact us if you need to install a specific library.