DuckDB over Pandas/Polars
Since my previous post on DuckDB (DuckDB as the New jq), I’ve been continuing to use and enjoy DuckDB.
Recently, I wanted to analyze and visualize some financial CSVs, including joining a few files together. I started out with Polars (which I understood to be a newer/better Pandas). However, as someone who doesn’t use it frequently, I found the syntax confusing and cumbersome.
For example, here is how I parsed a Transactions.csv
and summed entries by Category
for rows in 2024 (simplified example, code formatted with Black):
df = pl.read_csv("Transactions.csv")
df = (
df.select("Date", "Category", "Amount")
.with_columns(
pl.col("Date").str.to_date("%m/%d/%Y"),
pl.col("Amount")
.map_elements(lambda amount: amount.replace("$", ""))
.str.to_decimal(),
)
.filter(pl.col("Date") > datetime.date(2024, 1, 1))
.group_by("Category")
.agg(pl.col("Amount").sum())
)
print(df)
Things that tripped me up:
- The syntax for selecting and transforming columns
- Telling it how to parse the month/day/year column
- Writing a lambda to strip out the
$
(maybe there is a better way to do this?) - The mix of
df.
andpl.
calls, such as callingdf.group_by
but passing inpl.col(...).sum(...)
as the argument to the aggregation
I’m sure this is straightforward for someone who uses these tools frequently. However, that’s not me. I play around for a bit and then come back to it weeks or months later and have to relearn.
In contrast, I write SQL day in and day out, so I find it much easier. Once I switched to DuckDB, I could write much more familiar (to me) SQL, while still using python for the rest of the code:
results = duckdb.sql(
"""
select
Category,
sum(replace(Amount, '$', '')::decimal) as Amount
from read_csv('Transactions.csv')
where Date > '2024-01-01'
group by Category
"""
)
results.show()
Note that DuckDB automatically figured out how to parse the date column.
And I can even join multiple CSVs together with SQL and add more complex WHERE
conditions:
results = duckdb.sql(
"""
select
c.Group,
sum(replace(t.Amount, '$', '')::decimal) as Amount
from read_csv('Transactions.csv') t
join read_csv('Categories.csv') c on c.Category = t.Category
where t.Date > '2024-01-01'
and c.Type in ('Income', 'Expense')
group by c.Group
"""
)
results.show()
In summary, I find DuckDB powerful, easy, and fun to use.
Update:
A Reddit comment showed me how to remove the map_elements
:
pl.col("Amount").str.replace("\\$", "").str.to_decimal()
But I think the double use of .str
is a good example of how this is complex to me as a casual user.
Update 2:
Another Reddit comment showed how to do “a shorter (no intermediary steps) and more efficient (scan) version”:
df = (
pl.scan_csv("Transactions.csv")
.filter(pl.col("Date").str.to_date("%m/%d/%Y") > datetime.date(2024, 1, 1))
.group_by("Category")
.agg(pl.col("Amount").str.replace("\\$", "").str.to_decimal().sum())
.collect()
)
print(df)
Discussions:
There are some good discussions about this post, especially around the increased composability of Polars/Pandas vs SQL and better ways to write the Polars code: