Designing secondary indexes YSQL
The primary goal of an index is to enhance the performance of data retrieval operations on the data in the tables. Indexes are designed to quickly locate data without having to search every row in a database table and provide fast access for patterns other than that of the primary key of the table. In YugabyteDB, indexes are internally designed just like tables and operate as such. The main difference between a table and an index is that the primary key of the table has to be unique but it need not be unique for an index.
To illustrate secondary indexes, first create a sample census schema.
Set up a local cluster
If a local universe is currently running, first destroy it.
Start a local three-node universe with an RF of 3
by first creating a single node, as follows:
./bin/yugabyted start \
--advertise_address=127.0.0.1 \
--base_dir=${HOME}/var/node1 \
--cloud_location=aws.us-east-2.us-east-2a
On macOS, the additional nodes need loopback addresses configured, as follows:
sudo ifconfig lo0 alias 127.0.0.2
sudo ifconfig lo0 alias 127.0.0.3
Next, join more nodes with the previous node as needed. yugabyted
automatically applies a replication factor of 3
when a third node is added.
Start the second node as follows:
./bin/yugabyted start \
--advertise_address=127.0.0.2 \
--base_dir=${HOME}/var/node2 \
--cloud_location=aws.us-east-2.us-east-2b \
--join=127.0.0.1
Start the third node as follows:
./bin/yugabyted start \
--advertise_address=127.0.0.3 \
--base_dir=${HOME}/var/node3 \
--cloud_location=aws.us-east-2.us-east-2c \
--join=127.0.0.1
After starting the yugabyted processes on all the nodes, configure the data placement constraint of the universe, as follows:
./bin/yugabyted configure data_placement --base_dir=${HOME}/var/node1 --fault_tolerance=zone
This command can be executed on any node where you already started YugabyteDB.
To check the status of a running multi-node universe, run the following command:
./bin/yugabyted status --base_dir=${HOME}/var/node1
Setup
To set up a universe, refer to Set up a YugabyteDB Anywhere universe.Create a census table as follows:
CREATE TABLE census(
id int,
name varchar(255),
age int,
zipcode int,
employed boolean,
PRIMARY KEY(id ASC)
)
Add some data to the table as follows.
INSERT INTO public.census ( id,name,age,zipcode,employed ) VALUES
(1,'Zachary',55,94085,True), (2,'James',56,94085,False), (3,'Kimberly',50,94084,False),
(4,'Edward',56,94085,True), (5,'Barry',56,94084,False), (6,'Tyler',45,94084,False),
(7,'Nancy',47,94085,False), (8,'Sarah',52,94084,True), (9,'Nancy',59,94084,False),
(10,'Diane',51,94083,False), (11,'Ashley',42,94083,False), (12,'Jacqueline',58,94085,False),
(13,'Benjamin',49,94084,False), (14,'James',48,94083,False), (15,'Ann',43,94083,False),
(16,'Aimee',47,94085,True), (17,'Michael',49,94085,False), (18,'Rebecca',40,94085,False),
(19,'Kevin',45,94085,True), (20,'James',45,94084,False), (21,'Sandra',60,94085,False),
(22,'Kathleen',40,94085,True), (23,'William',42,94084,False), (24,'James',42,94083,False),
(25,'Tyler',50,94085,False), (26,'James',49,94085,True), (27,'Kathleen',55,94083,True),
(28,'Zachary',55,94083,True), (29,'Rebecca',41,94085,True), (30,'Jacqueline',49,94085,False),
(31,'Diane',48,94083,False), (32,'Sarah',53,94085,True), (33,'Rebecca',55,94083,True),
(34,'William',47,94085,False), (35,'William',60,94085,True), (36,'Sarah',53,94085,False),
(37,'Ashley',47,94084,True), (38,'Ashley',54,94084,False), (39,'Benjamin',42,94083,False),
(40,'Tyler',47,94085,True), (41,'Michael',42,94084,False), (42,'Diane',50,94084,False),
(43,'Nancy',51,94085,False), (44,'Rebecca',56,94085,False), (45,'Tyler',41,94085,True);
Basic index
Suppose you need to look up the data based on the zip codes of the people in the census. You can fetch details with a query similar to the following:
select id from census where zipcode=94085;
This required a sequential scan of all the rows in the table. This is because the primary key of the table is id
, and looking up by zip code requires a full scan. To avoid the full scan, create an index on zipcode
so that the executor can quickly fetch the matching rows by looking at the index.
create index idx_zip on census(zipcode ASC);
Now, for a query to get all the people in zip code 94085 as follows:
explain (analyze, dist, costs off) select id from census where zipcode=94085;
You will see an output like the following:
Index Scan using idx_zip on public.census (actual time=3.273..3.295 rows=23 loops=1)
Output: id
Index Cond: (census.zipcode = 94085)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 1.401 ms
Storage Table Rows Scanned: 23
Storage Index Read Requests: 1
Storage Index Read Execution Time: 1.529 ms
Storage Index Rows Scanned: 23
...
The same 23 rows were fetched from the table, but much faster. This is because the planner uses the index to execute the query.
Covering index
In the prior example, to retrieve the rows the index was first looked up, and then more columns were fetched for the same rows from the table. This additional round trip to the table is needed because the columns are not present in the index. To avoid this, you can store the column along with the index as follows:
create index idx_zip2 on census(zipcode ASC) include(id);
Now, for a query to get all people in zip code 94085 as follows:
explain (analyze, dist, costs off) select id from census where zipcode=94085;
You will see an output like the following:
QUERY PLAN
-------------------------------------------------------------------------------------
Index Only Scan using idx_zip2 on census (actual time=1.930..1.942 rows=23 loops=1)
Index Cond: (zipcode = 94085)
Storage Index Read Requests: 1
Storage Index Read Execution Time: 1.042 ms
Storage Index Rows Scanned: 23
...
This is an index-only scan, which means that all the data required by the query has been fetched from the index. This is also why there was no entry for Table Read Requests.
When an index contains all the columns of the table, it is referred to as a Duplicate index. Duplicate indexes can be used in multi-region deployments to reduce read latencies.
Listing indexes
You can list the indexes associated with a table using the following methods.
\d+ meta command
The \d+ <table>
meta command lists the indexes associated with a table along with the schema details.
\d+ census
The indexes are listed at the end of the output as follows:
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
----------+------------------------+-----------+----------+---------+----------+--------------+-------------
id | integer | | not null | | plain | |
name | character varying(255) | | | | extended | |
age | integer | | | | plain | |
zipcode | integer | | | | plain | |
employed | boolean | | | | plain | |
Indexes:
"census_pkey" PRIMARY KEY, lsm (id ASC)
"idx_zip" lsm (zipcode ASC)
pg_indexes view
You can also fetch more information about indexes using the pg_indexes view.
SELECT * FROM pg_indexes WHERE tablename = 'census' ;
This gives an output similar to the following:
schemaname | tablename | indexname | tablespace | indexdef
------------+-----------+-------------+------------+---------------------------------------------------------------------
public | census | census_pkey | null | CREATE UNIQUE INDEX census_pkey ON public.census USING lsm (id ASC)
public | census | idx_zip | null | CREATE INDEX idx_zip ON public.census USING lsm (zipcode ASC)
Index usage
It's a good idea to keep track of how well indexes are used by your applications so that you can evaluate and improve your indexes, and drop indexes that are not used. To get the usage statistics of the indexes of a table, you can execute the following command:
SELECT * FROM pg_stat_user_indexes WHERE relname = 'census';
This should give an output similar to the following:
relid | indexrelid | schemaname | relname | indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
-------+------------+------------+---------+--------------+----------+--------------+---------------
17227 | 17230 | public | census | census_pkey | 2 | 12 | 0
17227 | 17237 | public | census | idx_zip | 2 | 24 | 0
You can get an idea of how many times the index was scanned and how many tuples were read from the index using this statistic.
Conclusion
While primary keys are essential to ensure data uniqueness and facilitate efficient data distribution, secondary indexes provide the flexibility needed to optimize queries based on non-primary key columns. Using secondary indexes, applications can boost performance and provide a robust and scalable solution for managing large-scale, distributed datasets.