Designing optimal primary keys YSQL
The Primary key is a column or a set of columns that uniquely identifies a row, such as a user ID or order number. You should choose the primary key based on the most common access pattern. Columns of data type string, number, serial, or UUID make good choices for primary keys.
Automatically generating the primary key
The best way to uniquely identify a record is to allow the database to assign a unique identifier to the row. YugabyteDB supports multiple schemes for generating identifiers that you can choose based on the needs of your application.
UUID
A UUID is a 128-bit number represented as a string of 36 characters, including hyphens. For example, 4b6aa2ff-53e6-44f5-8bd0-ef9de90a8095
. YugabyteDB natively supports UUID generation as per RFC 4122 via the uuid-ossp extension. UUIDs have several advantages:
- The likelihood of generating duplicate UUIDs is extremely low.
- UUIDs can be independently generated on different nodes in the cluster without any coordination with other systems.
- The randomness of UUIDs makes it hard to predict the next ID, providing an additional layer of security.
You can add a UUID to your schema as follows:
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
name TEXT
);
The DEFAULT clause ensures that for every row inserted, a UUID is automatically generated and inserted along with the row.
Serial
Serial is a special data type in YugabyteDB that creates an auto-incrementing integer column starting with 1
. It is essentially a shorthand for creating a sequence and using it as a default value for a column. You can choose between three types of serial data types depending on the needs of your application:
- SMALLSERIAL - An integer column in the range of 1 to 32,767.
- SERIAL - An integer column in the range of 1 to 2,147,483,647.
- BIGSERIAL - An integer column in the range of 1 to 9,223,372,036,854,775,807.
Serial can be used directly in table definitions to simplify the creation of auto-incrementing columns.
DROP TABLE IF EXISTS users;
CREATE TABLE users (
id serial,
name TEXT,
PRIMARY KEY(id)
);
For each row inserted into the table, an auto-incremented id
value is automatically inserted along with the row.
Sequence
A sequence is a database object that generates a sequence of unique numbers. Sequences are independent objects that can be associated with one or more tables or columns. Sequences offer more flexibility and control over auto-incrementing behavior. They can be created, managed, and used separately from table definitions. Sequences can be customized with different increment values, start values, minimum and maximum values, and cycle behavior.
CREATE SEQUENCE user_id_seq START 100 INCREMENT BY 100 CACHE 10000;
DROP TABLE IF EXISTS users;
CREATE TABLE users (
id INTEGER DEFAULT nextval('user_id_seq'),
name TEXT,
PRIMARY KEY(id)
);
For every row inserted, user IDs are automatically generated as 100, 200,300, and so on.
Existing columns as primary keys
To illustrate how to choose existing columns as primary keys, first create a sample census schema.
Set up a local cluster
If a local universe is currently running, first destroy it.
Start a local three-node universe with an RF of 3
by first creating a single node, as follows:
./bin/yugabyted start \
--advertise_address=127.0.0.1 \
--base_dir=${HOME}/var/node1 \
--cloud_location=aws.us-east-2.us-east-2a
On macOS, the additional nodes need loopback addresses configured, as follows:
sudo ifconfig lo0 alias 127.0.0.2
sudo ifconfig lo0 alias 127.0.0.3
Next, join more nodes with the previous node as needed. yugabyted
automatically applies a replication factor of 3
when a third node is added.
Start the second node as follows:
./bin/yugabyted start \
--advertise_address=127.0.0.2 \
--base_dir=${HOME}/var/node2 \
--cloud_location=aws.us-east-2.us-east-2b \
--join=127.0.0.1
Start the third node as follows:
./bin/yugabyted start \
--advertise_address=127.0.0.3 \
--base_dir=${HOME}/var/node3 \
--cloud_location=aws.us-east-2.us-east-2c \
--join=127.0.0.1
After starting the yugabyted processes on all the nodes, configure the data placement constraint of the universe, as follows:
./bin/yugabyted configure data_placement --base_dir=${HOME}/var/node1 --fault_tolerance=zone
This command can be executed on any node where you already started YugabyteDB.
To check the status of a running multi-node universe, run the following command:
./bin/yugabyted status --base_dir=${HOME}/var/node1
Setup
To set up a universe, refer to Set up a YugabyteDB Anywhere universe.Create a census table as follows:
CREATE TABLE census(
id int,
name varchar(255),
age int,
zipcode int,
employed boolean,
PRIMARY KEY(id ASC)
)
Add some data to the table as follows.
INSERT INTO public.census ( id,name,age,zipcode,employed ) VALUES
(1,'Zachary',55,94085,True), (2,'James',56,94085,False), (3,'Kimberly',50,94084,False),
(4,'Edward',56,94085,True), (5,'Barry',56,94084,False), (6,'Tyler',45,94084,False),
(7,'Nancy',47,94085,False), (8,'Sarah',52,94084,True), (9,'Nancy',59,94084,False),
(10,'Diane',51,94083,False), (11,'Ashley',42,94083,False), (12,'Jacqueline',58,94085,False),
(13,'Benjamin',49,94084,False), (14,'James',48,94083,False), (15,'Ann',43,94083,False),
(16,'Aimee',47,94085,True), (17,'Michael',49,94085,False), (18,'Rebecca',40,94085,False),
(19,'Kevin',45,94085,True), (20,'James',45,94084,False), (21,'Sandra',60,94085,False),
(22,'Kathleen',40,94085,True), (23,'William',42,94084,False), (24,'James',42,94083,False),
(25,'Tyler',50,94085,False), (26,'James',49,94085,True), (27,'Kathleen',55,94083,True),
(28,'Zachary',55,94083,True), (29,'Rebecca',41,94085,True), (30,'Jacqueline',49,94085,False),
(31,'Diane',48,94083,False), (32,'Sarah',53,94085,True), (33,'Rebecca',55,94083,True),
(34,'William',47,94085,False), (35,'William',60,94085,True), (36,'Sarah',53,94085,False),
(37,'Ashley',47,94084,True), (38,'Ashley',54,94084,False), (39,'Benjamin',42,94083,False),
(40,'Tyler',47,94085,True), (41,'Michael',42,94084,False), (42,'Diane',50,94084,False),
(43,'Nancy',51,94085,False), (44,'Rebecca',56,94085,False), (45,'Tyler',41,94085,True);
ID as the primary key
In the census
table, the most likely way to look up a person is by their id
, so the primary key has been set to id ASC
. This means that the data is stored in ascending order of ID, ensuring contiguous IDs are mostly located in the same tablet. This works well for point lookups on ID and range scans on IDs. For example, to look up ID 9, you can do the following:
select * from census where id=9;
You will see output similar to the following:
id | name | age | zipcode | employed
----+-------+-----+---------+----------
9 | Nancy | 59 | 94084 | f
One row matching ID 9 was quickly fetched with just one request. You can also do a quick range scan.
select * from census where id>=5 and id<=15;
You will see an output similar to the following:
id | name | age | zipcode | employed
----+------------+-----+---------+----------
5 | Barry | 56 | 94084 | f
6 | Tyler | 45 | 94084 | f
7 | Nancy | 47 | 94085 | f
8 | Sarah | 52 | 94084 | t
9 | Nancy | 59 | 94084 | f
10 | Diane | 51 | 94083 | f
11 | Ashley | 42 | 94083 | f
12 | Jacqueline | 58 | 94085 | f
13 | Benjamin | 49 | 94084 | f
14 | James | 48 | 94083 | f
15 | Ann | 43 | 94083 | f
(11 rows)
11 rows were quickly retrieved as the data is stored sorted on the id
column. So range scans are also fast.
Name as the primary key
Suppose your most common lookup is based on the name. In this case you would make the name
column part of the primary key. Because the name alone may not be unique enough to be the primary key (the primary key has to be unique), you can choose a primary key with both name and ID as follows:
CREATE TABLE census2(
id int,
name varchar(255),
age int,
zipcode int,
employed boolean,
PRIMARY KEY(name ASC, id ASC)
);
-- copy the same data into census2
INSERT INTO census2 SELECT * FROM census;
When specifying the primary key, the name
column is specified first, and id
second. This ensures that the data is stored sorted based on name
first, and for all matching names, the id
is stored sorted in ascending order, ensuring all people with the same name will be stored in the same tablet. This allows you to do a fast lookup on name
even though (name, id)
is the primary key.
Retrieve all the people with the name James as follows:
select * from census2 where name = 'James';
You will see output similar to the following:
id | name | age | zipcode | employed
----+-------+-----+---------+----------
2 | James | 56 | 94085 | f
14 | James | 48 | 94083 | f
20 | James | 45 | 94084 | f
24 | James | 42 | 94083 | f
26 | James | 49 | 94085 | t
(5 rows)
There are 5 people named James, and all of them can be quickly looked up as the data has been sorted on name.
Ordering
The primary key was specified withASC
order. However, if the queries are going to retrieve data in descending order with ORDER BY name DESC
, then it is better to match the same ordering in the primary key definition.