STORE_ID MONTH AMT_SPEND
1 1 1 100.12011
2 2 1 100.31441
3 NA 1 100.40517
4 1 2 99.67098
5 2 2 98.39703
6 NA 2 98.81231
7 1 3 102.27124
8 2 3 100.20843
9 NA 3 NA
Language interoperability and different ways of enabling “polyglot” workflows have seemed to take centerstage in the data world recently:
- Apache Arrow promises a language-independent memory format for interoperability, - RStudio its rebranding as Posit to cement their place as a leader in language-agnostic data tooling,
- RStudio simultaneously announced Quarto as an interoperable alternative to RMarkdown which will treat python, Julia, and JS as first-class citizens
- dbt has released its beta of python models to extend is previously SQL-focused paradigm
As a general matter, these are all exciting advances with great potential to aid in different workflows when used judiciously. However, it also poses the question: what cognitive burdens do we alleviate and which do we add when our projects begin to leverage multiple languages?
Despite common data analysis tools like SQL, R, and python being high-level languages with declarative interfaces (in the case of R’s tidyverse
and python’s pandas
), successful usage still requires understanding the underlying assumptions and operations of each tool. There is not such thing as a truly declarative language; only those that generally make decisions that the user likes well-enough to ask for the “what” and delegate the “how”. These differences can emerge at many different levels: such as foundational issues like whether data structures are copied or modified in-place or broader design choices like default hyperparameters in machine learning libraries (e.g. python’s scikitlearn
notoriously uses regularized logistic regression as the default for logistic regression.) Somewhere along that spectrum lies the fickle issue of handling null values.
In this post, I recap a quick case study of how incautious null handling risks data analysis validity. Then, taking a step back, I compare how R, python, and SQL behave differently when confront with null values and the implications for analysts switching between languages.
TLDR
A summary of these different behaviors is provided below:
R | python | SQL | |
---|---|---|---|
Column Aggregation | NA | np: NA pd: Value |
Value |
Row-wise Transformation | NA | NA | NA |
Joining | Match by default | Match | No match |
Filtering | No match | Match | No match |
Case Study
Before comparing different languages, let’s walk through a brief case study to see all the way that “lurking” nulls can surprise a junior analyst in any one language and observe a few different “contours” of the problem space.
Consider two tables in a retailer’s database. The spend
table reports total sales by month and store identifier (null if online).
Similarly, the returns
table reports returned sales at the same grain.
STORE_ID MONTH AMT_RETURN
1 1 1 NA
2 2 1 9.972159
3 NA 1 10.071639
4 1 2 9.798444
5 2 2 10.254347
6 NA 2 9.881071
7 1 3 10.108880
8 2 3 9.951398
9 NA 3 9.849277
In both cases, nulls are used in the 'AMT_*'
fields to denote zeros for the respective month x store_id
combinations`.
To calculate something as simple as the average gross spend per store across months, an analyst might attempt to write:
select
store_id, avg(amt_spend)
from spend
group by 1
order by 1
store_id | avg(amt_spend) |
---|---|
NA | 99.60874 |
1 | 100.68744 |
2 | 99.63996 |
However, because SQL silently drops nulls in column aggregations, the online spend is not appropriately “penalized” for its lack of March spend. The averages across all three stores look nearly equal.
Not only is this answer “wrong”, it can also be thought of as fundamentally changing the computand (a word I just made up. In statistics, we talk about estimands as “the conceptual thing we are trying to estimate with an estimator”. Here, we aren’t estimating anything – just computing. But, there’s still a concentual “thing we are trying to measure” and in this case, it’s our tools and not our methods that are imposing assumptions on that) to one that answers a fundamentally different question:
Instead of measuring “average monthly spend in Q1 by store”, we’re measuring “averaging monthly spend in Q1 by store conditional on there being spend”.
To obtain the correct result, one would write:
select
store_id, -- wrong answers
avg(amt_spend) as wrong1,
sum(amt_spend) / count(amt_spend) as wrong2,
-- right answers
sum(amt_spend) / count(1) as right1,
avg(coalesce(amt_spend, 0)) as right2
from spend
group by 1
order by 1
store_id | wrong1 | wrong2 | right1 | right2 |
---|---|---|---|---|
NA | 99.60874 | 99.60874 | 66.40583 | 66.40583 |
1 | 100.68744 | 100.68744 | 100.68744 | 100.68744 |
2 | 99.63996 | 99.63996 | 99.63996 | 99.63996 |
With a better understand of gross sales, the analyst might next proceed to compute net sales.
This first requires joining the spend
and returns
tables. Naively, they might attempt:
select
*,
spend.
returns.amt_returnfrom
spendinner join
returns on
= returns.store_id and
spend.store_id month = returns.month spend.
STORE_ID | MONTH | AMT_SPEND | amt_return |
---|---|---|---|
1 | 1 | 100.12011 | NA |
2 | 1 | 100.31441 | 9.972159 |
1 | 2 | 99.67098 | 9.798444 |
2 | 2 | 98.39703 | 10.254347 |
1 | 3 | 102.27124 | 10.108880 |
2 | 3 | 100.20843 | 9.951398 |
However, this once again fails. Why? Although SQL handled nulls “permissively” when aggregating a column, it took a stricted stance when making the comparison on spend.store_id = returns.store_id
in the join clause. SQL doesn’t recognize different nulls as equal. To the extent than null means “I dunno” versus “The field is not relevant to this observation”, it’s reasonable that SQL should find it hard to decide whether two “I dunno”s are equal.
Once again, this isn’t a “random” or inconsequential error. Continuing to use this corrupted dataset changes the computand from “net sales by month” to “net sales by month at physical retail locations”.
To remedy this, we can force store_id
to take on a value:
select
*,
spend.
returns.amt_returnfrom
spendinner join
returns on
coalesce(spend.store_id, 999) = coalesce(returns.store_id, 999) and
month = returns.month spend.
STORE_ID | MONTH | AMT_SPEND | amt_return |
---|---|---|---|
1 | 1 | 100.12011 | NA |
2 | 1 | 100.31441 | 9.972159 |
NA | 1 | 100.40517 | 10.071639 |
1 | 2 | 99.67098 | 9.798444 |
2 | 2 | 98.39703 | 10.254347 |
NA | 2 | 98.81231 | 9.881071 |
1 | 3 | 102.27124 | 10.108880 |
2 | 3 | 100.20843 | 9.951398 |
NA | 3 | NA | 9.849277 |
And next we proceed with computing sales by month net of returns across all stores:
select
month,
spend.sum(amt_spend - amt_return) as net_spend
from
spendinner join
returns on
coalesce(spend.store_id, 999) = coalesce(returns.store_id, 999) and
month = returns.month
spend.group by 1
order by 1
month | net_spend |
---|---|
1 | 180.6758 |
2 | 266.9465 |
3 | 182.4194 |
However, by now, you should not be surprised that this result is also incorrect. If we inspect the sequence of computations, we realize that SQL is also stricter in its null handing in rowwise computations than column-wise aggregations. The subtraction of amt_spend
and amt_return
obliterates the total when either is null. So, we fail to include the gross spend at Store 1 in January simply because there were no returns (and vice versa for Internet sales in March).
select
month,
spend.
spend.store_id,
amt_spend,
amt_return,- amt_return as net_spend
amt_spend from
spendinner join
returns on
coalesce(spend.store_id, 999) = coalesce(returns.store_id, 999) and
month = returns.month spend.
month | store_id | amt_spend | amt_return | net_spend |
---|---|---|---|---|
1 | 1 | 100.12011 | NA | NA |
1 | 2 | 100.31441 | 9.972159 | 90.34225 |
1 | NA | 100.40517 | 10.071639 | 90.33353 |
2 | 1 | 99.67098 | 9.798444 | 89.87254 |
2 | 2 | 98.39703 | 10.254347 | 88.14268 |
2 | NA | 98.81231 | 9.881071 | 88.93124 |
3 | 1 | 102.27124 | 10.108880 | 92.16236 |
3 | 2 | 100.20843 | 9.951398 | 90.25704 |
3 | NA | NA | 9.849277 | NA |
A few ways to get the correct answer are shown below:
select
month,
spend.sum(coalesce(amt_spend,0) - coalesce(amt_return,0)) as right1,
sum(amt_spend) - sum(amt_return) as right2
from
spendinner join
returns on
coalesce(spend.store_id, 999) = coalesce(returns.store_id, 999) and
month = returns.month
spend.group by 1
order by 1
month | right1 | right2 |
---|---|---|
1 | 280.7959 | 280.7959 |
2 | 266.9465 | 266.9465 |
3 | 172.5701 | 172.5701 |
Observations
The preceding example hopefully illustrates a few points:
- Nulls can cause issues in the most basic of analyses
- Beyond causing random or marginal errors, null handling changes the questions being answered
- Even within a language, null handling may feel inconsistent (w.r.t. strictness) across different operations
So, with that, let’s compare languages!
Comparison
Below, we compare how R, SQL, and python handle column aggregation, rowwise transformation, joining, and filtering.
Aggregation
SQL, as we saw before, simply ignores nulls in aggregation functions.
select
sum(x) as sum_x,
sum(if(x is null,1,0)) as n_null_x
from tbl
sum_x | n_null_x |
---|---|
3 | 1 |
Built by and for statistician’s, R is scandalized at the very idea of attempting to do math with null columns. For aggregation functions, it returns NA
as a form of protest should any entry of the vector provided be null. (This can be overridden with the na.rm
parameter.)
<- c(1,2,NA)
x sum(x)
<- data.frame(x = x)
df ::summarize(df, x = sum(x)) dplyr
[1] NA
x |
---|
NA |
When it comes to python, well, it depends. Base and numpy
operations act more like R whereas pandas
aggregation acts more like SQL.
import pandas as pd
import numpy as np
= [1,2,np.nan]
x = [3,4,5]
y = pd.DataFrame({'x':x,'y':y})
df sum(x)
sum(x)
np.'x': ['sum']}) df.agg({
nan
nan
x
sum 3.0
Transformation
All of SQL, R, and python return NA when NA
s are used in atomic or rowwise transformations.
In SQL:
select *, x-y as z
from tbl
x | y | z |
---|---|---|
1 | 3 | -2 |
2 | 4 | -2 |
NA | 5 | NA |
In R:
<- data.frame(x = c(1,2,NA), y = 3:5)
df ::mutate(df, z = x-y)
dplyr
$z <- with(df, x-y)
df df
x | y | z |
---|---|---|
1 | 3 | -2 |
2 | 4 | -2 |
NA | 5 | NA |
x | y | z |
---|---|---|
1 | 3 | -2 |
2 | 4 | -2 |
NA | 5 | NA |
In python:
- np.array(y)
np.array(x) = lambda d: d.x - d.y) df.assign(z
array([-2., -2., nan])
x y z
0 1.0 3 -2.0
1 2.0 4 -2.0
2 NaN 5 NaN
Joining
The situation with joins may feel like the opposite of aggregation. Here, R and python’s most popular data wrangling packages are more permissive than SQL.
As we saw in the case study, SQL does not match on nulls.
Consider tbl1
and tbl2
as shown below:
select * from tbl1
A | B | X |
---|---|---|
1 | NA | TRUE |
select * from tbl2
A | B | Y |
---|---|---|
1 | NA | FALSE |
Attempts to join return no results:
select tbl1.*, tbl2.Y
from
inner join tbl2
tbl1 on
= tbl2.A and
tbl1.A = tbl2.B tbl1.B
A | B | X | y |
---|
In contrast, default behavior for base R’s merge
and dplyr
does match on nulls. (Although, either behavior can be altered with the incomparables
or na_matches
arguments, respectively.)
<- data.frame(A = 1, B = NA, X = TRUE)
df1 <- data.frame(A = 1, B = NA, Y = FALSE)
df2 merge(df1, df2, by = c("A", "B"))
::inner_join(df1, df2, by = c("A", "B")) dplyr
A | B | X | Y |
---|---|---|---|
1 | NA | TRUE | FALSE |
A | B | X | Y |
---|---|---|---|
1 | NA | TRUE | FALSE |
Similarly, pandas
also matches on nulls for joining.
import numpy as np
import pandas as pd
= pd.DataFrame([[1, np.nan, True]], columns = ['A','B','X'])
df1 = pd.DataFrame([[1, np.nan, False]], columns = ['A','B','Y'])
df2 = ['A','B']) pd.merge(df1, df2, on
A B X Y
0 1 NaN True False
R
and python
’s behavior here seems most surprising. One might expect joining to work the same as raw logical evaluation works. However, neither language “likes” null comparison in its raw form. Instead, the default behavior is intentionally altered in these higher-level joining functions.
In R:
NA == NA
[1] NA
In python:
== np.nan np.nan
False
Filtering
Finally, both SQL and R drop null records used in filtering statements since comparisons with these values are incapable of returning a TRUE/FALSE value that is used to subset the rows. In python, however, pandas does preserve nulls in filter conditions.
Using the same tbl1
shown above, we can also confirm that SQL proactively drops nulls in where clauses where they cannot be readily compared to non-null values. This seems quite consistent with its behavior in the joining case.
select A, B, X
from tbl1
where B != 1
a | b | x |
---|
Both base R and dplyr
paradigms follow suit here.
<- data.frame(A = 1, B = NA, X = TRUE)
df1 $B != 1,]
df1[df1::filter(df1, B != 1) dplyr
A | B | X | |
---|---|---|---|
NA | NA | NA | NA |
A | B | X |
---|
However, bucking the trend, multiple approaches to subsetting pandas
data will not drop nulls in filtering comparisons.
= pd.DataFrame([[1, np.nan, True]], columns = ['A','B','X'])
df1 != 1]
df1[df1.B 'B != 1') df1.query(
A B X
0 1 NaN True
A B X
0 1 NaN True
Conclusion
In data computation and analysis, the devil is often in the details. It’s not breaking news that low-level reasoning on the careful handling of null values can jeopardize the resulting analyses. However, as analysts take on increasingly complex tasks and using a plehora of different tools, it’s more important than ever for both data producers and consumers to consider the choices they are making in encoding and handling these values across the stack.