The basics to database indexing

A database index is a data structure used to improve the speed of data retrieval operations in a database. It works by organizing the data in the database into a separate structure that can be accessed more efficiently than searching through the entire database.

Indexes are typically created on one or more columns in a table and store a copy of the values of those columns in a separate structure. This allows the database to quickly locate the rows that match a particular query condition, without having to scan the entire table.

When a query is executed that includes the indexed column, the database can use the index to quickly locate the matching rows and return the results. This can significantly improve the performance of queries, especially when working with large datasets.

There are several types of indexes, in this post we cover the btree.

The B-TREE structure

B-trees are a type of self-balancing tree data structure used for organizing data in a database. They are designed to optimize data access and retrieval by minimizing the number of disk I/O operations required to access data.

A B-tree consists of a root node, internal nodes, and leaf nodes. The root node is the topmost node of the tree, while the internal nodes and leaf nodes are below it. Each node in the tree has a fixed number of keys and pointers to child nodes.

The keys in the nodes are arranged in ascending order and are used to partition the data into smaller subsets. This makes it possible to quickly locate the desired subset of data when searching for a particular key.

B-trees are called "balanced" because they maintain a roughly equal number of keys in each node, which ensures that the height of the tree remains logarithmic with respect to the number of keys in the tree. This means that the number of disk I/O operations required to access any data item in the tree is proportional to the logarithm of the total number of data items.

B-trees are widely used in databases and file systems to efficiently store and access large amounts of data. They are particularly well-suited for applications that require high-speed data access and retrieval, such as search engines and large-scale data processing systems.

We will work on a countries table

CREATE TABLE countries (
  id INT PRIMARY KEY,
  name VARCHAR(255),
  currency_code VARCHAR(10),
  capital_name VARCHAR(255),
  population INT,
  created_at DATE NOT NULL
);

The first index

CREATE INDEX (name) ON countries;

SELECT * FROM countries WHERE name = "Scotland";

This will query the results based on the name index.

Beware of the wildcard search

SELECT * FROM countries WHERE name LIKE "%cotlan%";

This query will not be using the index. For wildcard searches you should create a FULLTEXT index, that is not covered in this post.

Composite indexes

Indexes can also be created on multiple columns, known as composite indexes. These indexes can further improve query performance by allowing the database to quickly locate rows that match complex query conditions involving multiple columns.

Composite indexes works from left to right, no skipping and to the first range.

CREATE INDEX (population, name, created_at) ON countries;

SELECT * FROM countries WHERE population = 2000000 AND name = 'Sweden' AND created_at = '1523-06-06';

This will scan all the rows from the index.

SELECT * FROM countries WHERE population > 2000000 AND name = 'Sweden' AND created_at = '

This will only fetch the population from the index and then scan the table for the rest of the data, since range (><) breaks the index.

SELECT * FROM countries WHERE name = 'Sweden' AND created_at = '1523-06-06';

This will NOT fetch the query from the index since we skipped the left column (name).

Conclusion

Indexes do come with some trade-offs. They take up additional storage space and can slow down write operations, as the database must update both the table and the index whenever a row is inserted, updated or deleted. It's important to carefully consider the balance between query performance and write performance when deciding which columns to index in a database.