Understanding B-Tree Indexing In Five Minutes

A No-Jargon Primer On B-Tree Database Indexing

Jon McEwen
4 min readJun 3, 2022
Photo by Shubham Dhage on Unsplash

B-Trees are a way of organizing data in a database for faster retrieval. We organize physical objects for faster retrieval all of the time. For example, imagine you had a lot of shoes. And for ease of access, you store them in bins according to their color.

When you want to wear your black high-top Vans, you open the bin labeled for black shoes. Storing shoes by color prevents you from having to sift through every pair.

B-Trees accomplish something similar.

B-trees store data in nodes, not bins, and categorize data by index value, not color. Each Node contains a specific range of index values.

For example, if my application is looking to retrieve a record whose index value is 130, and I have a node whose range is between 100 and150, I know the node holds the record. Therefore, I know to search that node and pluck out the record I need.

Nodes have index ranges. Records have index values. If we know both, we know where to look. That’s the first key to the puzzle.

The second key is understanding the rules by which nodes in B-Trees reference other nodes, and how these rules create the overall structure of a B-Tree.

For example, a node can point to a sub-node that must contain index values that are either:

  1. Less than the first value in the parent node.
  2. Between two values in the parent node.
  3. Or greater than the last value in a parent node.

Here’s a visual.

B-Tree Example

In a B-Tree, the green dots above represent references to sub-nodes. After looking at the B-Tree, you’ll see that the parent node’s values are either the starting point, ending point, or both, of the sub-nodes below.

  • Dot #1 points to a node whose values are all less than 5 (the end point).
  • Dot #2 points to a node whose values are between 5 and 7 (the start and end points).
  • And dot #3 points to a node whose values are greater than 7 (the start point).

It’s this logical structure that allows B-Trees to improve retrieval time. The structure accomplishes this by reducing the number of index values that need to be scanned before finding the target. Consider the steps to find index value 16 in our B-Tree:

  1. Check the root node for 16 (the root node is always the entry point)
  2. Determine the greatest value in the node, which is 7, is less than 16 (total scans 2)
  3. Follow the reference (green dot #3) to its sub-node knowing all the values in this sub-node are greater than 7
  4. Find index value 16 after scanning index values (total scans 3).

We found 16 with 5 scans total.

If the B-Tree structure didn’t break up index values into nodes and include references to sub-nodes, we would need to scan every index value sequentially.

Here’s what that would look like:

Simple List of Index Values

That gives us a total of 9 scans to find 16.

In order for a B-Tree to assist a search in this manner, its nodes must be evenly balanced. A B-Tree that isn’t balanced will require unnecessary scans. Here’s an unbalanced B-Tree.

Unbalanced B-Tree

In total, 8 scans are necessary to find 16. To prevent a B-Tree from becoming unbalanced, each time a value is inserted, the B-Tree structure is reshuffled so that it is balanced.

You can visualize how the reshuffling occurs using this website.

A B-Tree is just a digital landscape with a single entry point with signposts to specific index ranges along the path. Follow the signs, and you’ll find the index. Once you find the index — you’ve found your black high-top Vans.

Photo by Dan Edge on Unsplash

— — —

If you enjoyed this article, you can follow me on Medium for similar content.

--

--