Polars - modern data frame library

Apr 15, 2023

Intro

Almost five years passed since last time I’ve extensively used data frame library to calculate things. I sort of switched from analyzing data and building application around that to more “regular” programming and data engineering. I sometimes still use data.table in R or pandas in Python for ad-hoc analysis but nowadays I rarely need to do it. For completeness I should add that data frame library is a software library that support operations on data frames which are some kind of tables, usually kept in memory used for data analysis.

Recently I was exposed to polars. As they describe Lightning-fast DataFrame library for Rust and Python. Initially I was neutral, thinking “ok just another data frame library just written in rust”, but it turned out to be better in several aspects than other libraries of this kind. I got excited about data frame library again, I didn’t expect this to happen.

In this post I’m gonna describe why I thinks it’s better alternative to pandas, dplyr or data.table and why I’m excited.

My history with data frame libraries

At the beginning of my career, when I was basically a data analyst, I used data frame libraries daily. Mostly I used dplyr and data.table in R and when I was working in Python I’ve used pandas. After around 2-3 years using mainly dplyr I switched to data.table due to better performance, more stable API and much fewer dependencies.

Sample of syntax

I think it’s a good idea to expose sample of Polars syntax before we move on. Here is a sample of basic usage of Polars library in Python

import polars as pl


df = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
    }
)

df2 = (
    df.select(
        pl.col("A"),
        pl.col("fruits"),
        pl.col("B").max().alias("B_max")
    )
    .filter(pl.col("fruits") != "banana")
    .groupby("fruits")
    .agg(
        pl.col("A").sum().alias("A_sum"),
        pl.col("B_max").max()
    )
)

We can see there a simple manipulation on columns, filtration and aggregation.

Why am I excited about Polars?

Lazy evaluation

Polars provide very solid implementation of lazy evaluation comparing to other data frame libraries. To perform transformations in lazy fashion you just need to call .lazy() method on a DataFrame which would convert it to LazyFrame which has almost identical API as regular DataFrame. The difference is that operation will be performed when .collect() method will be called. There are at least two advantages of using this approach. One is a room for optimization. Usually we perform many operations on data frames. In (default) eager mode top-level operations are executed sequentially. In lazy approach, just before .collect() we are aware of all operations and the engine can perform optimization because at the time no operation is being executed yet. I’ve not precisely measure the difference in performance but it feels to be significant. Let’s consider the following example on a ~4GB data frame.

df2 = (
    df
    .select(
        pl.when(pl.col("id") % 13 == 0).then(-1).otherwise(pl.col("id")).alias("special_id"),
        pl.col("animal"),
        pl.col("rand_num")
    )
    .filter(
        (pl.col("rand_num") >= 50.0) & (pl.col("rand_num") < 250.0))
    .groupby("animal")
    .agg(
        pl.col("rand_num").mean().alias("rand_num_avg"),
        pl.col("rand_num").max().alias("rand_num_max"),
        pl.col("rand_num").filter(pl.col("special_id") == -1).mean().alias("rand_num_special_avg")
    )
    .sort("animal")
)

It took on my MacBook Air M1 (8GB RAM) around 12 seconds to calculate. When we use lazy API which differers only in adding .lazy() and .collect():

df2 = (
    df
    .lazy()
    .select(
        ...
    .sort("animal")
    .collect()
)

it takes on average 8 seconds. This was very straightforward transformation and still we got over 30% faster using lazy API. Moreover lazy API is very convenient to split transformation into multiply functions where each takes polars.LazyFrame as input argument and return polars.LazyFrame.

The second important advantage of lazy evaluation is operating on bigger data then fits in memory but still using the same API. We can use scan_csv to lazily read data from disk. Here’s an example of data aggregation done on 27GB file on 8GB RAM MacBook:

# Analyzing 27GB data file on 8GB RAM
df_big = pl.scan_csv(source='test.csv', has_header=True, sep=";")

(
    df_big
    .groupby('animal')
    .agg(
        [
            pl.count().alias('cnt'),
            pl.sum('rand_num').alias('rand_num_sum'),
            pl.avg('rand_num').alias('rand_num_avg')
        ]
    )
    .collect()
)

It was executed on average in 38 seconds. For comparison I’ve run similar aggregation, but just for single sum to keep it even simpler on AWK using the following command

awk -F ';' '{ x[$2] += $3 } END { print x }' test.csv

It took 16 minutes(!) on the same 27GB csv file.

Expressions

One of fundamental building blocks in Polars are Polars expressions. In general Polars expression is any function that transforms Polars series into another Polars series. There are few advantageous aspects of Polars expressions. Firstly expressions are optimized. Particularly if expression need to be executed on multiple columns, then it will be parallelized. It’s one of reasons behind Polars high performance. Another aspect is the fact the Polars implements an extensive set of builtin expressions that user can compose (chain) into more complex expressions. Functional approach of chaining expressions enable great developer experience in my opinion. We have already seen some examples of expressions like

pl.col("A").sum().alias("A_sum")

What I additionally really like about expression is the fact that it’s a separate thing. Somewhat detached from DataFrame object itself. What I mean is that we can freely define custom transformation in isolation from the actual data on which we would apply those transformation. In real world use cases of data frame libraries data transformations get very complex in matter of seconds. Having a way to break down complexity in clean form and without impact on performance is in my opinion one of the most important characteristics.

Let’s consider the following example

def sum_positive(expr: pl.Expr) -> pl.Expr:
    return expr.filter(expr > 0).sum()

def merge_if(col_name_a: str, col_name_b: str, predicate: pl.Expr) -> pl.Expr:
    """Merges two columns into one by given predicate"""
    return (
        pl.when(predicate).
        then(pl.col(col_name_a)).
        otherwise(pl.col(col_name_b))
    )

df.with_columns(
    sum_positive(pl.col("B")).alias("B_sum_pos")
).with_columns(
    merge_if("B", "B_sum_pos", pl.col("fruits") == "banana").alias("B_or_pos")
)


shape: (5, 6)
┌─────┬────────┬─────┬────────┬───────────┬──────────┐
│ A   ┆ fruits ┆ B   ┆ cars   ┆ B_sum_pos ┆ B_or_pos │
│ --- ┆ ---    ┆ --- ┆ ---    ┆ ---       ┆ ---      │
│ i64 ┆ str    ┆ i64 ┆ str    ┆ i64       ┆ i64      │
╞═════╪════════╪═════╪════════╪═══════════╪══════════╡
│ 1   ┆ banana ┆ 5   ┆ beetle ┆ 15        ┆ 5        │
│ 2   ┆ banana ┆ 4   ┆ audi   ┆ 15        ┆ 4        │
│ 3   ┆ apple  ┆ 3   ┆ beetle ┆ 15        ┆ 15       │
│ 4   ┆ apple  ┆ 2   ┆ beetle ┆ 15        ┆ 15       │
│ 5   ┆ banana ┆ 1   ┆ beetle ┆ 15        ┆ 1        │
└─────┴────────┴─────┴────────┴───────────┴──────────┘

We defined functions that either takes a Polar expression or strings which are meant to be column names. Usually pl.Expr sounds more generic but in case when this expression is mostly use for pl.col(_) case, then probably choosing string-based API would be more concise. Anyway the fact that we can compose and decompose expressions is great.

Apache Arrow format

Polars uses Apache Arrow format for storing the data in data frame. It’s very cool, because Apache Arrow format is very cool. Apache aims to standardize data format for fast and efficient computation. The standardization part here is the most important. If you need to exchange data between systems and all of them know how to read/write the same (binary) format, then there is no need to serialization and deserialization which causes a lot of waste. In my personal experience it was a huge win when we use Polars together with Ray. We can store and then retrieve Polars data frames from ray without any serialization, which improved performance and made interoperability seamless.

Python-Rust interoperability

Polars, as name suggests (Pola[rs]), is written primarily in Rust but is also exposed as Python library. What I really like it is the fact that one can easily implement new custom transformation in Rust and intergate it with Polars in Python. More details in the next paragraph.

Rust

As I said I really like Polars implementation in Rust. Writing high performance transformation run in parallel or async is much easier then doing the same in C. At least for some people. So if you have crucial piece of data transformation that is run frequently and probably on big data sets and needs to be executed as fast as possible you can spend some time and write custom optimized implementation in Rust and then use it from your Python project.

I’ve tried to do so. It was my first time writing Python module in Rust. It turned out to be simpler then I thought. Let me briefly sketch how minimal setup looks like.

Python module written in Rust

To write a Python module in rust one can use great library pyo3 which defines Python bindings in Rust. Another tool which helped me with this task was maturin a Python library for building and publishing Rust code as a Python module. It’s not necessary but it’s very convenient. If you want to follow above examples you need to have Python (3.x) and Rust compiler installed.

Let’s start from creating a new project and virtual env:

mkdir rust_from_python
cd rust_from_python
python3 -m venv .venv
source .venv/bin/activate
pip install maturin

Now we can create a new Rust project within rust_from_python catalog using maturin new command.

maturin new my_rust_module

# You can choose option how it can be created (I use pyo3)
# The following catalog will be created:

.
└── my_rust_module
    ├── Cargo.toml
    ├── pyproject.toml
    └── src
        └── lib.rs

3 directories, 3 files

By default ./my_rust_module/src/lib.rs contains a single function sum_as_string as an example:

use pyo3::prelude::*;

/// Formats the sum of two numbers as string.
#[pyfunction]
fn sum_as_string(a: usize, b: usize) -> PyResult<String> {
    Ok((a + b).to_string())
}

/// A Python module implemented in Rust.
#[pymodule]
fn my_rust_module(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(sum_as_string, m)?)?;
    Ok(())
}

To build the Rust project as a Python module using maturin we can simply execute maturin develop from ./my_rust_module catalog. And it’s done! Easy as that. Now we can use it in our Python project. For example let’s create a main.py script in the rust_from_python top catalog:

from my_rust_module import sum_as_string

def main() -> None:
    print(sum_as_string(42, 13))

if __name__ == '__main__':
    main()

Now we can just run it and we’ll get 55 as a result.

Polars extension in Rust

Author or Polars, Ritchie Vink, wrote additionally a helper Rust crate (library) pyo3-polars to make passing DataFrame and Series between Python and Rust even simpler. This crate basically defines two new types in Rust PyDataFrame and PySeries which can be converted to and from Python Polars DataFrame and Series.

Now let’s try to add new function written in Rust that will take a polars DataFrame from Python, then will calculate a new column in Rust and return it back to Python. To do so we need to add polars and pyo3-polars dependencies in our my_rust_module project.

// Cargo.toml

[dependencies]
polars = { version = "0.27.2", features = ["fmt"] }
polars-core = { version = "0.27.2" }
pyo3-polars = "0.2.0"

Now we can add new function calc_new_column_in_rust which takes a Polars DataFrame from Python, calculates a new column in Rust and finally return Python Polars DataFrame back to Python. Our ./my_rust_module/src/lib.rs after adding new function looks like this

use pyo3::prelude::*;
use pyo3::exceptions::PyBaseException;
use pyo3_polars::PyDataFrame;
use polars::frame::DataFrame;
use polars_core::prelude::*;


/// Formats the sum of two numbers as string.
#[pyfunction]
fn sum_as_string(a: usize, b: usize) -> PyResult<String> {
    Ok((a + b).to_string())
}

/// A Python module implemented in Rust.
#[pymodule]
fn my_rust_module(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(sum_as_string, m)?)?;
    m.add_function(wrap_pyfunction!(calc_new_column_in_rust, m)?)?;
    Ok(())
}

#[pyfunction]
fn calc_new_column_in_rust(pydf: PyDataFrame, col_name: &str) -> PyResult<PyDataFrame> {
    let mut df: DataFrame = pydf.into();
    let rows = df.height();

    let mut xs: Vec<i32> = vec!{};
    for i in 0..rows {
        xs.push((i as i32) + 143 / 4)
    }

    let s = Series::new(col_name, &xs);
    let df2 = df.with_column(s);

    match df2 {
        Ok(frame) => Ok(PyDataFrame(frame.clone())),
        _ => Err(PyBaseException::new_err("Error from rust"))
    }
}

Now we are ready to do maturin develop once again to build new version of Rust extension module. Once it’s done we can use our new Rust function calc_new_column_in_rust in our Python code:

import polars as pl
from my_rust_module import sum_as_string, calc_new_column_in_rust

def main() -> None:
    print(sum_as_string(42, 13))
    df = pl.DataFrame({"A": [1, 42, 13, -10]})
    df2 = calc_new_column_in_rust(df, "rust_col")
    print(df2)


if __name__ == '__main__':
    main()

Summary

In summary I really enjoyed Polars after my initial encounter. I liked it performance, expressions and how easy it is to write custom extensions in Rust. For sure I’ll continue to learn more about Polars. Perhaps it would be a good idea to write few examples in Rust using Polars next time and take a closer look into Polars implementation details.