Postgres Window Functions: Diving Deeper With LAG, LEAD, And CUME_DIST

5 min readApr 22, 2022

In my last article, I introduced the basics of window functions in PostgreSQL. We covered functionality and syntax, identifying the critical elements used to write a window function. But we only briefly explored examples, and there’s plenty more to cover.

This article will look at more advanced use cases of window functions, including my favorite window functions lag, lead, and cumulative distribution. Since we covered the fundamentals in my previous article, we’re jumping right into code this time.

CUME_DIST, Or Cumulative Distribution

CUME_DIST returns a percentile ranking for the row (based on a specified column). For example, if I score within the 75th percentile of a test, I score higher than or as well as 75% of test-takers. CUME_DIST returns exactly this kind of metric.

Say we had a table of data that represents how many books a person has read each year. Our table would look like this:

Here’s our data, unsorted for years 2020 and 2021.

To rank people according to how many books they read in a given year and return their percentile rank, we would use the following CUME_DIST query:

SELECT *,
       Cume_dist()
         OVER(
           ORDER BY total_books ASC) AS percintile
FROM   books_read
WHERE  year = 2020;

Our output might look something like this:

Robert Stoggins is in the 71st percentile of book readers.

LAG

What if we wanted to tabulate the variance of total books read between consecutive rows? In plain English, the variance would be how many more books someone read than the person directly behind them in the ranking.

The LAG function returns a value from a previous row, and we can use it to create the variance data point. We will name the column “variance”.

Using our sample data, here’s what the query would look like:

SELECT *,
       total_books - Lag(total_books)
                       OVER(
                         ORDER BY total_books) AS variance
FROM   books_read
WHERE  year = 2020;

And here’s what the output might look like:

Ashley Bigsby read 6 more books than Hugh Mayer, and Kelley Napples read 4 more books than Ashley.

LEAD

While LAG pulls data from rows behind the current row, LEAD pulls data from rows in front of the current row.

If we were to pretend our data was for a book reading competition, we could use the LEAD window function to calculate a goal — the number of additional books a participant would need to read to advance in the rankings.

For example, if I read ten books last year, and the person ahead of me read 13 books, I would need to read at least four books to leap forward in the rankings. Here’s how to calculate that number using LEAD:

SELECT full_name,
       total_books,
       ( Lead(total_books) OVER(ORDER BY total_books)
         + 1 ) - total_books AS goal
FROM   books_read
WHERE  year = 2020

And our output might look like this:

Jeremy Timbat needs to read 3 additional books to overtake Trisha Mullins for the top spot.

Both LAG and LEAD allow you to specify an offset that determines how far backward or forwards to pull data. These two functions are quite helpful and versatile.

WINDOW FRAMES

We’ve covered many window functions in this article and my last article, and we’re ready to wrap up our studies with an exploration of window frames, which work behind the scenes for window functions.

What Are Window Frames?

Window functions use window frames in conjunction with the current row to calculate column values. Consider this, in the cumulative distribution example, we calculated a row’s percentile rank in relation to the entire table of rows. In this context, the window frame used by the window function comprises all rows of data within the table.

The picture below illustrates a current row and the window frame used to calculate its percentile rank in the instance of a cumulative distribution.

A window frame of this scope makes sense when calculating percentiles.

In other instances, though, you may need to adjust the scope of your window frame, defining where its rows start and end. To give a concise demonstration of this, we can use our book data and calculate a new column called “neighborhood total,” which will be the sum of books read for a row and its two neighbors.

Here’s what the query would look like:

SELECT full_name,
       total_books,
       SUM(total_books)
         over(
           ORDER BY total_books ROWS BETWEEN 1 preceding AND 1 following) AS
       "neighborhood total"
FROM   books_read
WHERE  year = 2020

And here’s our data:

Pretty cool, right? We now have an idea of the general reading productivity of a specific location in the rankings. As a data analyst, this functionality is fascinating.

Here’s another example of specifying a window frame’s size. Here, we want to calculate the sum of books read for every row after the current row. We will call this column “Reverse Running Total”.

SELECT full_name,
       total_books,
       SUM(total_books)
         over(
           ORDER BY total_books RANGE BETWEEN CURRENT ROW AND unbounded
         following) AS
       "Reverse Running Total"
FROM   books_read
WHERE  year = 2020

And of course, our output data:

What do you think? Are you as excited about window functions as I am? Though their syntax can be hard to remember, they can be a data analyst’s best friend. Practice them regularly, and they will become second nature.

For additional resources, try reading these articles: Matrics Maven: Window Frames in PostgreSQL and PostgreSL Window Functions: The Ultimate Guide

Postgres Window Functions: Diving Deeper With LAG, LEAD, And CUME_DIST

CUME_DIST, Or Cumulative Distribution

LAG

LEAD

WINDOW FRAMES

Written by Jon McEwen

No responses yet