DuckDB over Pandas/Polars

November 1, 2024 2 minute read

Since my previous post on DuckDB (DuckDB as the New jq), I’ve been continuing to use and enjoy DuckDB.

Recently, I wanted to analyze and visualize some financial CSVs, including joining a few files together. I started out with Polars (which I understood to be a newer/better Pandas). However, as someone who doesn’t use it frequently, I found the syntax confusing and cumbersome.

For example, here is how I parsed a Transactions.csv and summed entries by Category for rows in 2024 (simplified example, code formatted with Black):

df = pl.read_csv("Transactions.csv")
df = (
    df.select("Date", "Category", "Amount")
    .with_columns(
        pl.col("Date").str.to_date("%m/%d/%Y"),
        pl.col("Amount")
        .map_elements(lambda amount: amount.replace("$", ""))
        .str.to_decimal(),
    )
    .filter(pl.col("Date") > datetime.date(2024, 1, 1))
    .group_by("Category")
    .agg(pl.col("Amount").sum())
)

print(df)

Things that tripped me up:

The syntax for selecting and transforming columns
Telling it how to parse the month/day/year column
Writing a lambda to strip out the $ (maybe there is a better way to do this?)
The mix of df. and pl. calls, such as calling df.group_by but passing in pl.col(...).sum(...) as the argument to the aggregation

I’m sure this is straightforward for someone who uses these tools frequently. However, that’s not me. I play around for a bit and then come back to it weeks or months later and have to relearn.

In contrast, I write SQL day in and day out, so I find it much easier. Once I switched to DuckDB, I could write much more familiar (to me) SQL, while still using python for the rest of the code:

results = duckdb.sql(
    """
    select
        Category,
        sum(replace(Amount, '$', '')::decimal) as Amount
    from read_csv('Transactions.csv')
    where Date > '2024-01-01'
    group by Category
"""
)
results.show()

Note that DuckDB automatically figured out how to parse the date column.

And I can even join multiple CSVs together with SQL and add more complex WHERE conditions:

results = duckdb.sql(
    """
    select
        c.Group,
        sum(replace(t.Amount, '$', '')::decimal) as Amount
    from read_csv('Transactions.csv') t
    join read_csv('Categories.csv') c on c.Category = t.Category
    where t.Date > '2024-01-01'
    and c.Type in ('Income', 'Expense')
    group by c.Group
"""
)
results.show()

In summary, I find DuckDB powerful, easy, and fun to use.

Update:

A Reddit comment showed me how to remove the map_elements:

pl.col("Amount").str.replace("\\$", "").str.to_decimal()

But I think the double use of .str is a good example of how this is complex to me as a casual user.

Update 2:

Another Reddit comment showed how to do “a shorter (no intermediary steps) and more efficient (scan) version”:

df = (
    pl.scan_csv("Transactions.csv")
    .filter(pl.col("Date").str.to_date("%m/%d/%Y") > datetime.date(2024, 1, 1))
    .group_by("Category")
    .agg(pl.col("Amount").str.replace("\\$", "").str.to_decimal().sum())
    .collect()
)

print(df)

Discussions:

There are some good discussions about this post, especially around the increased composability of Polars/Pandas vs SQL and better ways to write the Polars code:

Twitter Facebook LinkedIn

Paul Gross

DuckDB over Pandas/Polars

You May Also Enjoy

The Many Ways To Read Tech News

DuckDB as the New jq

Lessons Learned From Payments Startups

Storing External Requests