Postgres Window Functions: Diving Deeper With LAG, LEAD, And CUME_DIST
In my last article, I introduced the basics of window functions in PostgreSQL. We covered functionality and syntax, identifying the critical elements used to write a window function. But we only briefly explored examples, and there’s plenty more to cover.
This article will look at more advanced use cases of window functions, including my favorite window functions lag, lead, and cumulative distribution. Since we covered the fundamentals in my previous article, we’re jumping right into code this time.
CUME_DIST, Or Cumulative Distribution
CUME_DIST returns a percentile ranking for the row (based on a specified column). For example, if I score within the 75th percentile of a test, I score higher than or as well as 75% of test-takers. CUME_DIST returns exactly this kind of metric.
Say we had a table of data that represents how many books a person has read each year. Our table would look like this:
To rank people according to how many books they read in a given year and return their percentile rank, we would use the following CUME_DIST query:
SELECT *,
Cume_dist()
OVER(
ORDER BY total_books ASC) AS percintile
FROM books_read
WHERE year = 2020;
Our output might look something like this:
LAG
What if we wanted to tabulate the variance of total books read between consecutive rows? In plain English, the variance would be how many more books someone read than the person directly behind them in the ranking.
The LAG function returns a value from a previous row, and we can use it to create the variance data point. We will name the column “variance”.
Using our sample data, here’s what the query would look like:
SELECT *,
total_books - Lag(total_books)
OVER(
ORDER BY total_books) AS variance
FROM books_read
WHERE year = 2020;
And here’s what the output might look like:
LEAD
While LAG pulls data from rows behind the current row, LEAD pulls data from rows in front of the current row.
If we were to pretend our data was for a book reading competition, we could use the LEAD window function to calculate a goal — the number of additional books a participant would need to read to advance in the rankings.
For example, if I read ten books last year, and the person ahead of me read 13 books, I would need to read at least four books to leap forward in the rankings. Here’s how to calculate that number using LEAD:
SELECT full_name,
total_books,
( Lead(total_books) OVER(ORDER BY total_books)
+ 1 ) - total_books AS goal
FROM books_read
WHERE year = 2020
And our output might look like this:
Both LAG and LEAD allow you to specify an offset that determines how far backward or forwards to pull data. These two functions are quite helpful and versatile.
WINDOW FRAMES
We’ve covered many window functions in this article and my last article, and we’re ready to wrap up our studies with an exploration of window frames, which work behind the scenes for window functions.
What Are Window Frames?
Window functions use window frames in conjunction with the current row to calculate column values. Consider this, in the cumulative distribution example, we calculated a row’s percentile rank in relation to the entire table of rows. In this context, the window frame used by the window function comprises all rows of data within the table.
The picture below illustrates a current row and the window frame used to calculate its percentile rank in the instance of a cumulative distribution.
A window frame of this scope makes sense when calculating percentiles.
In other instances, though, you may need to adjust the scope of your window frame, defining where its rows start and end. To give a concise demonstration of this, we can use our book data and calculate a new column called “neighborhood total,” which will be the sum of books read for a row and its two neighbors.
Here’s what the query would look like:
SELECT full_name,
total_books,
SUM(total_books)
over(
ORDER BY total_books ROWS BETWEEN 1 preceding AND 1 following) AS
"neighborhood total"
FROM books_read
WHERE year = 2020
And here’s our data:
Pretty cool, right? We now have an idea of the general reading productivity of a specific location in the rankings. As a data analyst, this functionality is fascinating.
Here’s another example of specifying a window frame’s size. Here, we want to calculate the sum of books read for every row after the current row. We will call this column “Reverse Running Total”.
SELECT full_name,
total_books,
SUM(total_books)
over(
ORDER BY total_books RANGE BETWEEN CURRENT ROW AND unbounded
following) AS
"Reverse Running Total"
FROM books_read
WHERE year = 2020
And of course, our output data:
What do you think? Are you as excited about window functions as I am? Though their syntax can be hard to remember, they can be a data analyst’s best friend. Practice them regularly, and they will become second nature.
For additional resources, try reading these articles: Matrics Maven: Window Frames in PostgreSQL and PostgreSL Window Functions: The Ultimate Guide