Python Rgonomics: User-defined functions in polars

Photo credit to Hans-Jurgen Mager on Unsplash

polars API is a delight in part because of its consistency. Transformations are chained sequentially onto the DataFrame in a consistent series of steps without leaving the DataFrame. This helps developers get “in the flow”, produces highly readable and well-structured code, and cna make a very natural transition for users coming from R’s tidyverse who tend to think about data tranformations in a series of “pipes”.

While the API has excellent coverage over a wide range of standard transformations that data practitioners need¹, users may often find the need to incorporate user-defined functions (UDFs) into their logic to leverage other python libraries for domain-specific problems. This can arise frequently in data science where models more germane to modeling and statistical testing are, by design, not built separately into polars.

This creates multiple points of potential friction:

polars is highly readable due to the API’s consistent use of method chaining; applying UDFs shouldn’t break the flow
users may wish to return more complex “non-scalar” datatypes (e.g. multidimensional arrays, model objects) into the DataFrame²

This post is a quick reference to demonstrate polars’ numerous capabilities for integrating different types of external (from other packages) or custom (user modularized) logic without breaking the flow of polars transformations. Using examples from data simulation, model evlauation, and inference, we will explore methods for applying UDFs for transformation and aggregation, transforming complex objects within a polars pipe, and easy “escape hatches” to break the abstraction when necessary.

TLDR

Whenever possible, it is most efficient to express your custom user-defined function (UDF) in the native polars API. When the API affords the logic you need to do this, you can modularize that polars code into a function that takes an expression or a DataFrame as it’s first argument and add it to your polars code with:

pipe() – allows piping of expressions and DataFrames into UDFs
map_columns() – custom pipe function capable of handling contexts like selectors

For arbitrary python logic to transform expressions (i.e. at the column-level), you can use map_{batches|elemens()} within with_columns():

map_batches() – for applying non-polars vectorized functions (preferred)
map_elements() – for applying nonvectorized functions (less efficient)

Similarly, for arbitrary expression aggregation, map_groups() can be used inside of agg():

map_groups() – to keep everything in the DataFrame

However, there are numerour hacks and special cases to make your code either more efficient or more readable:

polars extensions may provide a more native Rust implementation of the logic
creating a generation function can mimic polars’s expression expansion, allowing you to apply the same transformation to many columns at once
the ability to map_*() objects with return type pl.Object means you can fit any number of complex objects (e.g. models) into a polars pipeline that you wish to keep wrangling
partition_by() provides an easy off-ramp for breaking out of the DataFrame abstraction for further processing with comfortable python-native patterns like list comprehensions

Set Up

We’ll load a few packages to begin:

import polars as pl
import polars.selectors as cs
import polars_ds as pds
import numpy as np
from numpy.random import binomial
from sklearn.metrics import roc_auc_score
import statsmodels.api as sm

Applying `polars` UDFs

Now, imagine you simply want to be able to apply and reuse a user-defined function (UDF) writeen with native polars logic. This is easily done with the pipe() method which can be chained onto either expressions (logic that computes variables in the DataFrame) or full DataFrames. Writing additional transformation logic with the native python API is preferable wherever it is possible since it allows polars to use the same data representations and optimizations.

We’ll start with a boring toy dataset.

data_dict = {
  'group': ['a']*4 + ['b']*4,
  'x': np.arange(1,9,1),
  'y': np.arange(8,0,-1), 
  'p': np.arange(1,9,1)/10
}
df = pl.DataFrame(data_dict)
df.glimpse()

Rows: 8
Columns: 4
$ group <str> 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'
$ x     <i64> 1, 2, 3, 4, 5, 6, 7, 8
$ y     <i64> 8, 7, 6, 5, 4, 3, 2, 1
$ p     <f64> 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8

Columns (`pipe` and `map_columns()`)

def cap(c:pl.Expr, ceil:int = 5) -> pl.Expr: return pl.when( c > ceil).then(ceil).otherwise( c )

df.with_columns( pl.col('x').pipe(cap))

shape: (8, 5)

group	x	y	p	literal
str	i64	i64	f64	i64
"a"	1	8	0.1	1
"a"	2	7	0.2	2
"a"	3	6	0.3	3
"a"	4	5	0.4	4
"b"	5	4	0.5	5
"b"	6	3	0.6	5
"b"	7	2	0.7	5
"b"	8	1	0.8	5

This is also works when applying a transformation to mutliple columns with selectors. However, the pipe can result in a conflict in which all variables have the same name (unlike native chaining). This is fixed by appending .name.keep() which access and reapplies the name of the initial column being mapped.

df.with_columns( cs.numeric().pipe(cap).name.keep() )

shape: (8, 4)

group	x	y	p
str	i64	i64	f64
"a"	1	5	0.1
"a"	2	5	0.2
"a"	3	5	0.3
"a"	4	5	0.4
"b"	5	4	0.5
"b"	5	3	0.6
"b"	5	2	0.7
"b"	5	1	0.8

If you have no other transformations you wish to do simultaneously, map_columns() is a slightly more concise alternative which accepts a column selector and a single transformation to be applied to all passed columns.

df.map_columns( cs.numeric(), cap)

shape: (8, 4)

group	x	y	p
str	i64	i64	f64
"a"	1	5	0.1
"a"	2	5	0.2
"a"	3	5	0.3
"a"	4	5	0.4
"b"	5	4	0.5
"b"	5	3	0.6
"b"	5	2	0.7
"b"	5	1	0.8

Data Frames (`pipe`)

Alternatively, you may wish to encapsulate logic operating at the level of the entire DataFrame versus an individual column.

def calc_diffs(df:pl.DataFrame, threshhold:int = 5) -> pl.DataFrame:

    df_out = (
        df
        .with_columns(
            abs = (pl.col('x') - pl.col('y')).abs(),
            abs_gt_t = (pl.col('x') - pl.col('y')).abs() > threshhold,
        )
    )
    return df_out

This too can be chained onto a DataFrame using the pipe():

df.pipe(calc_diffs)

shape: (8, 6)

group	x	y	p	abs	abs_gt_t
str	i64	i64	f64	i64	bool
"a"	1	8	0.1	7	true
"a"	2	7	0.2	5	false
"a"	3	6	0.3	3	false
"a"	4	5	0.4	1	false
"b"	5	4	0.5	1	false
"b"	6	3	0.6	3	false
"b"	7	2	0.7	5	false
"b"	8	1	0.8	7	true

Values of other arguments to your function can be passed with kwargs³

df.pipe(calc_diffs, threshhold = 3)

shape: (8, 6)

group	x	y	p	abs	abs_gt_t
str	i64	i64	f64	i64	bool
"a"	1	8	0.1	7	true
"a"	2	7	0.2	5	true
"a"	3	6	0.3	3	false
"a"	4	5	0.4	1	false
"b"	5	4	0.5	1	false
"b"	6	3	0.6	3	false
"b"	7	2	0.7	5	true
"b"	8	1	0.8	7	true

This also allows us to write DataFrame-level functions that operate on different variables by passing the variables as parameters.

def calc_diffs(df:pl.DataFrame, var1:str = 'x', var2:str = 'y', threshhold:int = 5) -> pl.DataFrame:

    df_out = (
        df
        .with_columns(
            abs = (pl.col(var1) - pl.col(var2)).abs(),
            abs_gt_t = (pl.col(var1) - pl.col(var2)).abs() > threshhold,
        )
    )
    return df_out 

df.pipe(calc_diffs, var1 = 'y', var2 = 'x', threshhold = 3)

shape: (8, 6)

group	x	y	p	abs	abs_gt_t
str	i64	i64	f64	i64	bool
"a"	1	8	0.1	7	true
"a"	2	7	0.2	5	true
"a"	3	6	0.3	3	false
"a"	4	5	0.4	1	false
"b"	5	4	0.5	1	false
"b"	6	3	0.6	3	false
"b"	7	2	0.7	5	true
"b"	8	1	0.8	7	true

Applying custom series transformations

Piping is great, but it can break down when you need to apply column transformations requiring multiple columns as inputs or requiring logic outside of the polars API. That’s where map_batches() and map_elements() become useful.

These methods chain onto expressions just like other transformations. However, they can accept as arguments any arbitrary python function, as well as specifications for the type of return (scalar or vector, data types). The two methods differ in that map_batches() expects the function to be vectorized whereas map_elements() can use any arbitrary function (but assumes it will have to iterate over inputs).

With these methods, we can input one or more expressions from a DataFrame and return either a scalar or a vector output.

Map Batches

Imagine we want to simulate draws from a binomial distribution, based on the sample size x and probability p in the dataset above.

In the simplest case in which our function receives 1 input, we can chain map_batches() onto that expression. Here, we simply provide the function of interest and option fields to confirm that our return value is a scalar (the result of a single coin flip) of type integer:

# one column in, one value out
df.with_columns(
    coin_flip = pl.col('p').map_batches(function = lambda p: binomial(n = 1, p = p), returns_scalar = True, return_dtype = pl.UInt16)
)

shape: (8, 5)

group	x	y	p	coin_flip
str	i64	i64	f64	u16
"a"	1	8	0.1	1
"a"	2	7	0.2	0
"a"	3	6	0.3	0
"a"	4	5	0.4	0
"b"	5	4	0.5	1
"b"	6	3	0.6	0
"b"	7	2	0.7	1
"b"	8	1	0.8	1

However, if our function requires multiple expressions as inputs, we must either create a struct or internally pass the names of those expressions to exprs (which I find cleaner). The function you are mapping must similar assume it is receiving an input containing those expressions in the same order and, thus, accessing them through indexing.

# two column in, one value out - with structs
df.with_columns(
    coin_flip = pl.struct('x','p').map_batches(
                               function = lambda z: binomial(n = z.struct['x'], p = z.struct['p']), 
                               returns_scalar = True, return_dtype = pl.UInt16)
)

# two columns in, one value out - with exprs
df.with_columns(
    coin_flip = pl.map_batches(exprs = ['x', 'p'],
                               function = lambda z: binomial(n = z[0], p = z[1]), 
                               returns_scalar = True, return_dtype = pl.UInt16)
)

shape: (8, 5)

group	x	y	p	coin_flip
str	i64	i64	f64	u16
"a"	1	8	0.1	0
"a"	2	7	0.2	0
"a"	3	6	0.3	2
"a"	4	5	0.4	2
"b"	5	4	0.5	2
"b"	6	3	0.6	4
"b"	7	2	0.7	5
"b"	8	1	0.8	6

Multiple Outputs

Finally, you can also return multiple outputs. Suppose we want to simulate 100 draws not just 1. Our internal function can instead return an array. Afterward, we can calculate the average outcome versus the expected value to see that this worked as intended.

# many columns out
df.with_columns(
    coin_flip = pl.struct('x','p').map_batches(
                               function = lambda z: binomial(n = z.struct['x'], 
                                                             p = z.struct['p'],
                                                             size = (100,z.shape[0])
                                                             ).transpose(), 
                                return_dtype = pl.Array(pl.UInt16, 100) 
                                )
).with_columns( 
    avg_outcome = pl.col('coin_flip').arr.mean(),
    exp_value = pl.col('x') * pl.col('p')
)

shape: (8, 7)

group	x	y	p	coin_flip	avg_outcome	exp_value
str	i64	i64	f64	array[u16, 100]	f64	f64
"a"	1	8	0.1	[0, 0, … 0]	0.09	0.1
"a"	2	7	0.2	[0, 0, … 0]	0.37	0.4
"a"	3	6	0.3	[0, 1, … 1]	0.94	0.9
"a"	4	5	0.4	[1, 2, … 4]	1.48	1.6
"b"	5	4	0.5	[2, 1, … 4]	2.47	2.5
"b"	6	3	0.6	[4, 1, … 4]	3.53	3.6
"b"	7	2	0.7	[5, 5, … 5]	4.7	4.9
"b"	8	1	0.8	[8, 8, … 7]	6.54	6.4

Applying custom aggregations (Map Groups)

Similar to column transformations, polars can also handle arbitrary data aggregation logic with map_groups().

Consider a DataFrame with multiple model scores:

data_dict = {
  'group': ['a']*4 + ['b']*4,
  'truth': [1,1,0,0]*2,
  'mod_bad': [0.25,0.25,0.75,0.75]*2, 
  'mod_bst': [0.99,0.75,0.25,0.01]*2,
  'mod_rnd': [0.5]*8,
  'mod_mix': [0.99,0.75,0.25,0.01]+[0.5]*3+[0.6]
}
df = pl.DataFrame(data_dict)
df.glimpse()

Rows: 8
Columns: 6
$ group   <str> 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'
$ truth   <i64> 1, 1, 0, 0, 1, 1, 0, 0
$ mod_bad <f64> 0.25, 0.25, 0.75, 0.75, 0.25, 0.25, 0.75, 0.75
$ mod_bst <f64> 0.99, 0.75, 0.25, 0.01, 0.99, 0.75, 0.25, 0.01
$ mod_rnd <f64> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5
$ mod_mix <f64> 0.99, 0.75, 0.25, 0.01, 0.5, 0.5, 0.5, 0.6

map_groups() allows us to conduct arbitrary aggregations, such as calculating AUROC by group from the scikit-learn package. As you can see, the approach is largely the same: specifying the expressions required, the function used, the return type, and the return structure.

df.group_by('group').agg (
    pl.map_groups(
    exprs = ['truth', 'mod_mix'],
    function = lambda x: roc_auc_score(x[0], x[1]),
    return_dtype = pl.Float64,
    returns_scalar = True
    )
)

shape: (2, 2)

group	truth
str	f64
"b"	0.25
"a"	1.0

Alternatives & extensions

Once you understand the different mapping capabilities available with polars, you can use these effectively both to expand the paradigm or to decide when you want to deviate from it. We’ll conclude by looking at some examples of each.

Extension libraries

Recall that native Rust and polars implementations will generally be faster than the techniques shown. Thus, another good option is to familiarize yourself with the burgeoning ecosystem of polars extensions to see if one suits your needs. Awesome Polars maintains a growing list of such packages.

For example, the polars-ds package can natively handle the AUROC use case above as it provides many useful evaluation functions:

df.group_by('group').agg (
    auroc = pds.query_roc_auc('truth', 'mod_bst')
)

shape: (2, 2)

group	auroc
str	f64
"b"	1.0
"a"	1.0

The Generator Trick

While pds.query_roc_auc() can calculate AUROC out of the box, it expect string column names as inputs – not expressions. That means we cannot benefit from polars’s selectors and expression expansion to calculate multiple combinations of columns with one line of code (e.g. calculation AUROC for each combination of truth and varying model scores).

To apply an aggregation to multiple column subsets, the Polars docs recommend a pattern like this:

write a wrapper function that handles the iteration and acts as a generator yielding the expression of interest
obtain relevant selectors to pass into the function (here, you can use the cs.expand_selectors() helpers or any raw parsing of the column names)
pass the generator into the standard df.group_by(...).agg(...) flow

def auroc_expressions(models):
    for m in models:
        yield pds.query_roc_auc( 'truth', m).alias(m)

mods = cs.expand_selector(df, cs.starts_with('mod_')) # could also do: [c for c in df.columns if c[:4] == 'mod_']
df.group_by('group').agg( auroc_expressions( mods ))

shape: (2, 5)

group	mod_bad	mod_bst	mod_rnd	mod_mix
str	f64	f64	f64	f64
"a"	-0.0	1.0	0.5	1.0
"b"	-0.0	1.0	0.5	0.25

Complex Object Types

polars DataFrames can hold arbitrary objects (of datatype pl.Object) – not just scalars and vectors. This means, if we so choose, we can do complex multi-step tasks without leaving the DataFrame⁴

Consider one final sample dataset:

data_dict = {
  'group': ['a']*4 + ['b']*4,
  'x': [0.99,0.75,0.25,0.01]*2,
  'y': [0.99,0.75,0.25,0.01]+[0.5]*3+[0.6]
}
df = pl.DataFrame(data_dict)
df.glimpse()

Rows: 8
Columns: 3
$ group <str> 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'
$ x     <f64> 0.99, 0.75, 0.25, 0.01, 0.99, 0.75, 0.25, 0.01
$ y     <f64> 0.99, 0.75, 0.25, 0.01, 0.5, 0.5, 0.5, 0.6

If we wish, we can even use map_groups() to create a column that represents complex objects like models and then map_elements() to extract information from these models.

(
df.group_by('group').agg (
    mod = pl.map_groups(
    exprs = ['x', 'y'],
    function = lambda x: sm.OLS( x[0].to_numpy(), sm.add_constant( x[1] )).fit() ,
    return_dtype = pl.Object,
    returns_scalar = True
    )
)
.with_columns(
    params = pl.col('mod').map_elements(lambda x: x.params, return_dtype = pl.List(pl.Float64)),
    r_sq  = pl.col('mod').map_elements(lambda x: x.rsquared, return_dtype = pl.Float64)
)
)

shape: (2, 4)

group	mod	params	r_sq
str	object	list[f64]	f64
"b"	<statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x0000020D3D09A490>	[3.93, -6.533333]	0.528971
"a"	<statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x0000020D5974DC90>	[-1.7347e-16, 1.0]	1.0

This pattern can be very useful if you are doing something like, for example, bootstrap aggregation. It might not be the most efficient computationally (not parallelized), but for small problems where speed is not a gamechanger, it can make for concise and readable analysis.

Partitions

However, just because you can keep everything in a DataFrame does not mean you should. The above pattern is useful if your end goal is to extract a singular quantity like a coefficient back into the DataFrame. However, if you ultimately want to go do other things with the objects you are generating, it may make for cleaner code to go ahead and break the DataFrame abstraction.

A final pattern I find particularly pleasant and effective is using the partition_by() method. This splits a DataFrame into separate frames based on grouping columns and organizes them in either a list (by default) or as a dictionary (when as_dict = True) indexed with a tuple containing the values of the grouping variable(s).

dfs = df.partition_by('group', as_dict = True, include_key = True)
for k,v in dfs.items():
    print(f"{k}  : {v}")

('a',)  : shape: (4, 3)
┌───────┬──────┬──────┐
│ group ┆ x    ┆ y    │
│ ---   ┆ ---  ┆ ---  │
│ str   ┆ f64  ┆ f64  │
╞═══════╪══════╪══════╡
│ a     ┆ 0.99 ┆ 0.99 │
│ a     ┆ 0.75 ┆ 0.75 │
│ a     ┆ 0.25 ┆ 0.25 │
│ a     ┆ 0.01 ┆ 0.01 │
└───────┴──────┴──────┘
('b',)  : shape: (4, 3)
┌───────┬──────┬─────┐
│ group ┆ x    ┆ y   │
│ ---   ┆ ---  ┆ --- │
│ str   ┆ f64  ┆ f64 │
╞═══════╪══════╪═════╡
│ b     ┆ 0.99 ┆ 0.5 │
│ b     ┆ 0.75 ┆ 0.5 │
│ b     ┆ 0.25 ┆ 0.5 │
│ b     ┆ 0.01 ┆ 0.6 │
└───────┴──────┴─────┘

This allows up to break up the data in the way we wish to process it, and then do processing in more python-native syntax such as a list comprehension. I find this makes highly concise and readable code and is ultimately a better strategy when further data wrangling is not needed.

dfs = df.partition_by('group', as_dict = True, include_key = True)
grps = [ k[0] for k in dfs.keys() ] # turn tuple to scalar bcs only one grouping var in key
mods = [ sm.OLS( d['x'].to_numpy(), 
                 sm.add_constant( d['y'].to_numpy() )
                ).fit() for k,d in dfs.items()]
coef = [m.params[1] for m in mods]
dict(zip( grps, coef))

{'a': np.float64(1.0000000000000004), 'b': np.float64(-6.533333333333337)}

Footnotes

Beyond dplyr, analogous to much of what an R user might find in stringr, lubridate, tidyr, among others↩︎
This is a non-uncommon pattern with R tidyverse’s list columns↩︎
That is, passed as a named argument to pipe which will, in turn, pass it to the internal function being piped.↩︎
This mirrors patterns from tidymodel, dplyr, and purrr in R.↩︎

TLDR

Set Up

Applying polars UDFs

Columns (pipe and map_columns())

Data Frames (pipe)

Applying custom series transformations

Map Batches

Multiple Outputs

Applying custom aggregations (Map Groups)

Alternatives & extensions

Extension libraries

The Generator Trick

Complex Object Types

Partitions

Footnotes

Applying `polars` UDFs

Columns (`pipe` and `map_columns()`)

Data Frames (`pipe`)