import polars as pl
import polars.selectors as cs
import polars_ds as pds
import numpy as np
from numpy.random import binomial
from sklearn.metrics import roc_auc_score
import statsmodels.api as sm
polars API is a delight in part because of its consistency. Transformations are chained sequentially onto the DataFrame in a consistent series of steps without leaving the DataFrame. This helps developers get “in the flow”, produces highly readable and well-structured code, and cna make a very natural transition for users coming from R’s tidyverse who tend to think about data tranformations in a series of “pipes”.
While the API has excellent coverage over a wide range of standard transformations that data practitioners need1, users may often find the need to incorporate user-defined functions (UDFs) into their logic to leverage other python libraries for domain-specific problems. This can arise frequently in data science where models more germane to modeling and statistical testing are, by design, not built separately into polars.
This creates multiple points of potential friction:
polarsis highly readable due to the API’s consistent use of method chaining; applying UDFs shouldn’t break the flow- users may wish to return more complex “non-scalar” datatypes (e.g. multidimensional arrays, model objects) into the DataFrame2
This post is a quick reference to demonstrate polars’ numerous capabilities for integrating different types of external (from other packages) or custom (user modularized) logic without breaking the flow of polars transformations. Using examples from data simulation, model evlauation, and inference, we will explore methods for applying UDFs for transformation and aggregation, transforming complex objects within a polars pipe, and easy “escape hatches” to break the abstraction when necessary.
TLDR
Whenever possible, it is most efficient to express your custom user-defined function (UDF) in the native polars API. When the API affords the logic you need to do this, you can modularize that polars code into a function that takes an expression or a DataFrame as it’s first argument and add it to your polars code with:
pipe()– allows piping of expressions and DataFrames into UDFsmap_columns()– custom pipe function capable of handling contexts like selectors
For arbitrary python logic to transform expressions (i.e. at the column-level), you can use map_{batches|elemens()} within with_columns():
map_batches()– for applying non-polars vectorized functions (preferred)map_elements()– for applying nonvectorized functions (less efficient)
Similarly, for arbitrary expression aggregation, map_groups() can be used inside of agg():
map_groups()– to keep everything in the DataFrame
However, there are numerour hacks and special cases to make your code either more efficient or more readable:
polarsextensions may provide a more native Rust implementation of the logic- creating a generation function can mimic
polars’s expression expansion, allowing you to apply the same transformation to many columns at once - the ability to
map_*()objects with return typepl.Objectmeans you can fit any number of complex objects (e.g. models) into apolarspipeline that you wish to keep wrangling partition_by()provides an easy off-ramp for breaking out of the DataFrame abstraction for further processing with comfortable python-native patterns like list comprehensions
Set Up
We’ll load a few packages to begin:
Applying polars UDFs
Now, imagine you simply want to be able to apply and reuse a user-defined function (UDF) writeen with native polars logic. This is easily done with the pipe() method which can be chained onto either expressions (logic that computes variables in the DataFrame) or full DataFrames. Writing additional transformation logic with the native python API is preferable wherever it is possible since it allows polars to use the same data representations and optimizations.
We’ll start with a boring toy dataset.
data_dict = {
'group': ['a']*4 + ['b']*4,
'x': np.arange(1,9,1),
'y': np.arange(8,0,-1),
'p': np.arange(1,9,1)/10
}
df = pl.DataFrame(data_dict)
df.glimpse()Rows: 8
Columns: 4
$ group <str> 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'
$ x <i64> 1, 2, 3, 4, 5, 6, 7, 8
$ y <i64> 8, 7, 6, 5, 4, 3, 2, 1
$ p <f64> 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8
Columns (pipe and map_columns())
def cap(c:pl.Expr, ceil:int = 5) -> pl.Expr: return pl.when( c > ceil).then(ceil).otherwise( c )
df.with_columns( pl.col('x').pipe(cap))| group | x | y | p | literal |
|---|---|---|---|---|
| str | i64 | i64 | f64 | i64 |
| "a" | 1 | 8 | 0.1 | 1 |
| "a" | 2 | 7 | 0.2 | 2 |
| "a" | 3 | 6 | 0.3 | 3 |
| "a" | 4 | 5 | 0.4 | 4 |
| "b" | 5 | 4 | 0.5 | 5 |
| "b" | 6 | 3 | 0.6 | 5 |
| "b" | 7 | 2 | 0.7 | 5 |
| "b" | 8 | 1 | 0.8 | 5 |
This is also works when applying a transformation to mutliple columns with selectors. However, the pipe can result in a conflict in which all variables have the same name (unlike native chaining). This is fixed by appending .name.keep() which access and reapplies the name of the initial column being mapped.
df.with_columns( cs.numeric().pipe(cap).name.keep() )| group | x | y | p |
|---|---|---|---|
| str | i64 | i64 | f64 |
| "a" | 1 | 5 | 0.1 |
| "a" | 2 | 5 | 0.2 |
| "a" | 3 | 5 | 0.3 |
| "a" | 4 | 5 | 0.4 |
| "b" | 5 | 4 | 0.5 |
| "b" | 5 | 3 | 0.6 |
| "b" | 5 | 2 | 0.7 |
| "b" | 5 | 1 | 0.8 |
If you have no other transformations you wish to do simultaneously, map_columns() is a slightly more concise alternative which accepts a column selector and a single transformation to be applied to all passed columns.
df.map_columns( cs.numeric(), cap)| group | x | y | p |
|---|---|---|---|
| str | i64 | i64 | f64 |
| "a" | 1 | 5 | 0.1 |
| "a" | 2 | 5 | 0.2 |
| "a" | 3 | 5 | 0.3 |
| "a" | 4 | 5 | 0.4 |
| "b" | 5 | 4 | 0.5 |
| "b" | 5 | 3 | 0.6 |
| "b" | 5 | 2 | 0.7 |
| "b" | 5 | 1 | 0.8 |
Data Frames (pipe)
Alternatively, you may wish to encapsulate logic operating at the level of the entire DataFrame versus an individual column.
def calc_diffs(df:pl.DataFrame, threshhold:int = 5) -> pl.DataFrame:
df_out = (
df
.with_columns(
abs = (pl.col('x') - pl.col('y')).abs(),
abs_gt_t = (pl.col('x') - pl.col('y')).abs() > threshhold,
)
)
return df_out This too can be chained onto a DataFrame using the pipe():
df.pipe(calc_diffs)| group | x | y | p | abs | abs_gt_t |
|---|---|---|---|---|---|
| str | i64 | i64 | f64 | i64 | bool |
| "a" | 1 | 8 | 0.1 | 7 | true |
| "a" | 2 | 7 | 0.2 | 5 | false |
| "a" | 3 | 6 | 0.3 | 3 | false |
| "a" | 4 | 5 | 0.4 | 1 | false |
| "b" | 5 | 4 | 0.5 | 1 | false |
| "b" | 6 | 3 | 0.6 | 3 | false |
| "b" | 7 | 2 | 0.7 | 5 | false |
| "b" | 8 | 1 | 0.8 | 7 | true |
Values of other arguments to your function can be passed with kwargs3
df.pipe(calc_diffs, threshhold = 3)| group | x | y | p | abs | abs_gt_t |
|---|---|---|---|---|---|
| str | i64 | i64 | f64 | i64 | bool |
| "a" | 1 | 8 | 0.1 | 7 | true |
| "a" | 2 | 7 | 0.2 | 5 | true |
| "a" | 3 | 6 | 0.3 | 3 | false |
| "a" | 4 | 5 | 0.4 | 1 | false |
| "b" | 5 | 4 | 0.5 | 1 | false |
| "b" | 6 | 3 | 0.6 | 3 | false |
| "b" | 7 | 2 | 0.7 | 5 | true |
| "b" | 8 | 1 | 0.8 | 7 | true |
This also allows us to write DataFrame-level functions that operate on different variables by passing the variables as parameters.
def calc_diffs(df:pl.DataFrame, var1:str = 'x', var2:str = 'y', threshhold:int = 5) -> pl.DataFrame:
df_out = (
df
.with_columns(
abs = (pl.col(var1) - pl.col(var2)).abs(),
abs_gt_t = (pl.col(var1) - pl.col(var2)).abs() > threshhold,
)
)
return df_out
df.pipe(calc_diffs, var1 = 'y', var2 = 'x', threshhold = 3)| group | x | y | p | abs | abs_gt_t |
|---|---|---|---|---|---|
| str | i64 | i64 | f64 | i64 | bool |
| "a" | 1 | 8 | 0.1 | 7 | true |
| "a" | 2 | 7 | 0.2 | 5 | true |
| "a" | 3 | 6 | 0.3 | 3 | false |
| "a" | 4 | 5 | 0.4 | 1 | false |
| "b" | 5 | 4 | 0.5 | 1 | false |
| "b" | 6 | 3 | 0.6 | 3 | false |
| "b" | 7 | 2 | 0.7 | 5 | true |
| "b" | 8 | 1 | 0.8 | 7 | true |
Applying custom series transformations
Piping is great, but it can break down when you need to apply column transformations requiring multiple columns as inputs or requiring logic outside of the polars API. That’s where map_batches() and map_elements() become useful.
These methods chain onto expressions just like other transformations. However, they can accept as arguments any arbitrary python function, as well as specifications for the type of return (scalar or vector, data types). The two methods differ in that map_batches() expects the function to be vectorized whereas map_elements() can use any arbitrary function (but assumes it will have to iterate over inputs).
With these methods, we can input one or more expressions from a DataFrame and return either a scalar or a vector output.
Map Batches
Imagine we want to simulate draws from a binomial distribution, based on the sample size x and probability p in the dataset above.
In the simplest case in which our function receives 1 input, we can chain map_batches() onto that expression. Here, we simply provide the function of interest and option fields to confirm that our return value is a scalar (the result of a single coin flip) of type integer:
# one column in, one value out
df.with_columns(
coin_flip = pl.col('p').map_batches(function = lambda p: binomial(n = 1, p = p), returns_scalar = True, return_dtype = pl.UInt16)
)| group | x | y | p | coin_flip |
|---|---|---|---|---|
| str | i64 | i64 | f64 | u16 |
| "a" | 1 | 8 | 0.1 | 1 |
| "a" | 2 | 7 | 0.2 | 0 |
| "a" | 3 | 6 | 0.3 | 0 |
| "a" | 4 | 5 | 0.4 | 0 |
| "b" | 5 | 4 | 0.5 | 1 |
| "b" | 6 | 3 | 0.6 | 0 |
| "b" | 7 | 2 | 0.7 | 1 |
| "b" | 8 | 1 | 0.8 | 1 |
However, if our function requires multiple expressions as inputs, we must either create a struct or internally pass the names of those expressions to exprs (which I find cleaner). The function you are mapping must similar assume it is receiving an input containing those expressions in the same order and, thus, accessing them through indexing.
# two column in, one value out - with structs
df.with_columns(
coin_flip = pl.struct('x','p').map_batches(
function = lambda z: binomial(n = z.struct['x'], p = z.struct['p']),
returns_scalar = True, return_dtype = pl.UInt16)
)
# two columns in, one value out - with exprs
df.with_columns(
coin_flip = pl.map_batches(exprs = ['x', 'p'],
function = lambda z: binomial(n = z[0], p = z[1]),
returns_scalar = True, return_dtype = pl.UInt16)
)| group | x | y | p | coin_flip |
|---|---|---|---|---|
| str | i64 | i64 | f64 | u16 |
| "a" | 1 | 8 | 0.1 | 0 |
| "a" | 2 | 7 | 0.2 | 0 |
| "a" | 3 | 6 | 0.3 | 2 |
| "a" | 4 | 5 | 0.4 | 2 |
| "b" | 5 | 4 | 0.5 | 2 |
| "b" | 6 | 3 | 0.6 | 4 |
| "b" | 7 | 2 | 0.7 | 5 |
| "b" | 8 | 1 | 0.8 | 6 |
Multiple Outputs
Finally, you can also return multiple outputs. Suppose we want to simulate 100 draws not just 1. Our internal function can instead return an array. Afterward, we can calculate the average outcome versus the expected value to see that this worked as intended.
# many columns out
df.with_columns(
coin_flip = pl.struct('x','p').map_batches(
function = lambda z: binomial(n = z.struct['x'],
p = z.struct['p'],
size = (100,z.shape[0])
).transpose(),
return_dtype = pl.Array(pl.UInt16, 100)
)
).with_columns(
avg_outcome = pl.col('coin_flip').arr.mean(),
exp_value = pl.col('x') * pl.col('p')
)| group | x | y | p | coin_flip | avg_outcome | exp_value |
|---|---|---|---|---|---|---|
| str | i64 | i64 | f64 | array[u16, 100] | f64 | f64 |
| "a" | 1 | 8 | 0.1 | [0, 0, … 0] | 0.09 | 0.1 |
| "a" | 2 | 7 | 0.2 | [0, 0, … 0] | 0.37 | 0.4 |
| "a" | 3 | 6 | 0.3 | [0, 1, … 1] | 0.94 | 0.9 |
| "a" | 4 | 5 | 0.4 | [1, 2, … 4] | 1.48 | 1.6 |
| "b" | 5 | 4 | 0.5 | [2, 1, … 4] | 2.47 | 2.5 |
| "b" | 6 | 3 | 0.6 | [4, 1, … 4] | 3.53 | 3.6 |
| "b" | 7 | 2 | 0.7 | [5, 5, … 5] | 4.7 | 4.9 |
| "b" | 8 | 1 | 0.8 | [8, 8, … 7] | 6.54 | 6.4 |
Applying custom aggregations (Map Groups)
Similar to column transformations, polars can also handle arbitrary data aggregation logic with map_groups().
Consider a DataFrame with multiple model scores:
data_dict = {
'group': ['a']*4 + ['b']*4,
'truth': [1,1,0,0]*2,
'mod_bad': [0.25,0.25,0.75,0.75]*2,
'mod_bst': [0.99,0.75,0.25,0.01]*2,
'mod_rnd': [0.5]*8,
'mod_mix': [0.99,0.75,0.25,0.01]+[0.5]*3+[0.6]
}
df = pl.DataFrame(data_dict)
df.glimpse()Rows: 8
Columns: 6
$ group <str> 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'
$ truth <i64> 1, 1, 0, 0, 1, 1, 0, 0
$ mod_bad <f64> 0.25, 0.25, 0.75, 0.75, 0.25, 0.25, 0.75, 0.75
$ mod_bst <f64> 0.99, 0.75, 0.25, 0.01, 0.99, 0.75, 0.25, 0.01
$ mod_rnd <f64> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5
$ mod_mix <f64> 0.99, 0.75, 0.25, 0.01, 0.5, 0.5, 0.5, 0.6
map_groups() allows us to conduct arbitrary aggregations, such as calculating AUROC by group from the scikit-learn package. As you can see, the approach is largely the same: specifying the expressions required, the function used, the return type, and the return structure.
df.group_by('group').agg (
pl.map_groups(
exprs = ['truth', 'mod_mix'],
function = lambda x: roc_auc_score(x[0], x[1]),
return_dtype = pl.Float64,
returns_scalar = True
)
)| group | truth |
|---|---|
| str | f64 |
| "b" | 0.25 |
| "a" | 1.0 |
Alternatives & extensions
Once you understand the different mapping capabilities available with polars, you can use these effectively both to expand the paradigm or to decide when you want to deviate from it. We’ll conclude by looking at some examples of each.
Extension libraries
Recall that native Rust and polars implementations will generally be faster than the techniques shown. Thus, another good option is to familiarize yourself with the burgeoning ecosystem of polars extensions to see if one suits your needs. Awesome Polars maintains a growing list of such packages.
For example, the polars-ds package can natively handle the AUROC use case above as it provides many useful evaluation functions:
df.group_by('group').agg (
auroc = pds.query_roc_auc('truth', 'mod_bst')
)| group | auroc |
|---|---|
| str | f64 |
| "b" | 1.0 |
| "a" | 1.0 |
The Generator Trick
While pds.query_roc_auc() can calculate AUROC out of the box, it expect string column names as inputs – not expressions. That means we cannot benefit from polars’s selectors and expression expansion to calculate multiple combinations of columns with one line of code (e.g. calculation AUROC for each combination of truth and varying model scores).
To apply an aggregation to multiple column subsets, the Polars docs recommend a pattern like this:
- write a wrapper function that handles the iteration and acts as a generator yielding the expression of interest
- obtain relevant selectors to pass into the function (here, you can use the
cs.expand_selectors()helpers or any raw parsing of the column names) - pass the generator into the standard
df.group_by(...).agg(...)flow
def auroc_expressions(models):
for m in models:
yield pds.query_roc_auc( 'truth', m).alias(m)
mods = cs.expand_selector(df, cs.starts_with('mod_')) # could also do: [c for c in df.columns if c[:4] == 'mod_']
df.group_by('group').agg( auroc_expressions( mods ))| group | mod_bad | mod_bst | mod_rnd | mod_mix |
|---|---|---|---|---|
| str | f64 | f64 | f64 | f64 |
| "a" | -0.0 | 1.0 | 0.5 | 1.0 |
| "b" | -0.0 | 1.0 | 0.5 | 0.25 |
Complex Object Types
polars DataFrames can hold arbitrary objects (of datatype pl.Object) – not just scalars and vectors. This means, if we so choose, we can do complex multi-step tasks without leaving the DataFrame4
Consider one final sample dataset:
data_dict = {
'group': ['a']*4 + ['b']*4,
'x': [0.99,0.75,0.25,0.01]*2,
'y': [0.99,0.75,0.25,0.01]+[0.5]*3+[0.6]
}
df = pl.DataFrame(data_dict)
df.glimpse()Rows: 8
Columns: 3
$ group <str> 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'
$ x <f64> 0.99, 0.75, 0.25, 0.01, 0.99, 0.75, 0.25, 0.01
$ y <f64> 0.99, 0.75, 0.25, 0.01, 0.5, 0.5, 0.5, 0.6
If we wish, we can even use map_groups() to create a column that represents complex objects like models and then map_elements() to extract information from these models.
(
df.group_by('group').agg (
mod = pl.map_groups(
exprs = ['x', 'y'],
function = lambda x: sm.OLS( x[0].to_numpy(), sm.add_constant( x[1] )).fit() ,
return_dtype = pl.Object,
returns_scalar = True
)
)
.with_columns(
params = pl.col('mod').map_elements(lambda x: x.params, return_dtype = pl.List(pl.Float64)),
r_sq = pl.col('mod').map_elements(lambda x: x.rsquared, return_dtype = pl.Float64)
)
)| group | mod | params | r_sq |
|---|---|---|---|
| str | object | list[f64] | f64 |
| "b" | <statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x0000020D3D09A490> | [3.93, -6.533333] | 0.528971 |
| "a" | <statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x0000020D5974DC90> | [-1.7347e-16, 1.0] | 1.0 |
This pattern can be very useful if you are doing something like, for example, bootstrap aggregation. It might not be the most efficient computationally (not parallelized), but for small problems where speed is not a gamechanger, it can make for concise and readable analysis.
Partitions
However, just because you can keep everything in a DataFrame does not mean you should. The above pattern is useful if your end goal is to extract a singular quantity like a coefficient back into the DataFrame. However, if you ultimately want to go do other things with the objects you are generating, it may make for cleaner code to go ahead and break the DataFrame abstraction.
A final pattern I find particularly pleasant and effective is using the partition_by() method. This splits a DataFrame into separate frames based on grouping columns and organizes them in either a list (by default) or as a dictionary (when as_dict = True) indexed with a tuple containing the values of the grouping variable(s).
dfs = df.partition_by('group', as_dict = True, include_key = True)
for k,v in dfs.items():
print(f"{k} : {v}")('a',) : shape: (4, 3)
┌───────┬──────┬──────┐
│ group ┆ x ┆ y │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═══════╪══════╪══════╡
│ a ┆ 0.99 ┆ 0.99 │
│ a ┆ 0.75 ┆ 0.75 │
│ a ┆ 0.25 ┆ 0.25 │
│ a ┆ 0.01 ┆ 0.01 │
└───────┴──────┴──────┘
('b',) : shape: (4, 3)
┌───────┬──────┬─────┐
│ group ┆ x ┆ y │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═══════╪══════╪═════╡
│ b ┆ 0.99 ┆ 0.5 │
│ b ┆ 0.75 ┆ 0.5 │
│ b ┆ 0.25 ┆ 0.5 │
│ b ┆ 0.01 ┆ 0.6 │
└───────┴──────┴─────┘
This allows up to break up the data in the way we wish to process it, and then do processing in more python-native syntax such as a list comprehension. I find this makes highly concise and readable code and is ultimately a better strategy when further data wrangling is not needed.
dfs = df.partition_by('group', as_dict = True, include_key = True)
grps = [ k[0] for k in dfs.keys() ] # turn tuple to scalar bcs only one grouping var in key
mods = [ sm.OLS( d['x'].to_numpy(),
sm.add_constant( d['y'].to_numpy() )
).fit() for k,d in dfs.items()]
coef = [m.params[1] for m in mods]
dict(zip( grps, coef)){'a': np.float64(1.0000000000000004), 'b': np.float64(-6.533333333333337)}
Footnotes
Beyond
dplyr, analogous to much of what an R user might find instringr,lubridate,tidyr, among others↩︎This is a non-uncommon pattern with R
tidyverse’s list columns↩︎That is, passed as a named argument to
pipewhich will, in turn, pass it to the internal function being piped.↩︎This mirrors patterns from
tidymodel,dplyr, andpurrrin R.↩︎