Testing high scale workloads of TPC-C benchmark
Workloads in TPC-C are defined by the number of warehouses the benchmark run will simulate. We will explore how YugabyteDb performs as the number of warehouses is increased.
Get TPC-C binaries
First, you need the benchmark binaries. To download the TPC-C binaries, run the following commands:
$ wget https://github.com/yugabyte/tpcc/releases/latest/download/tpcc.tar.gz
$ tar -zxvf tpcc.tar.gz
$ cd tpcc
Client machine
The client machine is where the benchmark is run from. An 8vCPU machine with at least 16GB memory is recommended. The following instance types are recommended for the client machine.
vCPU | AWS | AZURE | GCP |
---|---|---|---|
8 | c5.2xlarge | Standard_F8s_v2 | n2-highcpu-8 |
Cluster setup
The following cloud provider instance types are recommended for this test.
vCPU | AWS | AZURE | GCP |
---|---|---|---|
2 | m6i.large | Standard_D8s_v3 | n2-standard-2 |
8 | m6i.2xlarge | Standard_D2s_v3 | n2-standard-8 |
16 | m6i.4xlarge | Standard_D16s_v3 | n2-standard-16 |
Set up a local cluster
If a local universe is currently running, first destroy it.
Start a local three-node universe with an RF of 3
by first creating a single node, as follows:
./bin/yugabyted start \
--advertise_address=127.0.0.1 \
--base_dir=${HOME}/var/node1 \
--cloud_location=aws.us-east-2.us-east-2a
On macOS, the additional nodes need loopback addresses configured, as follows:
sudo ifconfig lo0 alias 127.0.0.2
sudo ifconfig lo0 alias 127.0.0.3
Next, join more nodes with the previous node as needed. yugabyted
automatically applies a replication factor of 3
when a third node is added.
Start the second node as follows:
./bin/yugabyted start \
--advertise_address=127.0.0.2 \
--base_dir=${HOME}/var/node2 \
--cloud_location=aws.us-east-2.us-east-2b \
--join=127.0.0.1
Start the third node as follows:
./bin/yugabyted start \
--advertise_address=127.0.0.3 \
--base_dir=${HOME}/var/node3 \
--cloud_location=aws.us-east-2.us-east-2c \
--join=127.0.0.1
After starting the yugabyted processes on all the nodes, configure the data placement constraint of the universe, as follows:
./bin/yugabyted configure data_placement --base_dir=${HOME}/var/node1 --fault_tolerance=zone
This command can be executed on any node where you already started YugabyteDB.
To check the status of a running multi-node universe, run the following command:
./bin/yugabyted status --base_dir=${HOME}/var/node1
Store the IP addresses of the nodes in a shell variable for use in further commands.
IPS=127.0.0.1,127.0.0.2,127.0.0.3
Setup
To set up a cluster, refer to Set up a YugabyteDB Aeon cluster.Store the IP addresses/public address of the cluster in a shell variable for use in further commands.
IPS=<cluster-name/IP>
Initialize the workloads
Before starting the workload, you need to load the data. Replace the IP addresses with that of the nodes in the cluster.
$ ./tpccbenchmark --create=true --nodes=${IPS}
$ ./tpccbenchmark --load=true --nodes=${IPS}
Cluster | Loader threads | Loading time | Data set size |
---|---|---|---|
3 nodes, type 2vCPU |
10 | ~13 minutes | ~20 GB |
The loading time for ten warehouses on a cluster with 3 nodes of type 16vCPU is approximately 3 minutes.
Before starting the workload, you need to load the data. Replace the IP addresses with that of the nodes in the cluster.
$ ./tpccbenchmark --create=true --nodes=${IPS}
$ ./tpccbenchmark --load=true --nodes=${IPS} --warehouses=100 --loaderthreads 48
Cluster | Loader threads | Loading time | Data set size |
---|---|---|---|
3 nodes, type 16vCPU | 48 | ~20 minutes | ~80 GB |
Tune the --loaderthreads
parameter for higher parallelism during the load, based on the number and type of nodes in the cluster. The specified 48 threads value is optimal for a 3-node cluster of type 16vCPU. For larger clusters or computers with more vCPUs, increase this value accordingly. For clusters with a replication factor of 3, a good approximation is to use the number of cores you have across all the nodes in the cluster.
Before starting the workload, you need to load the data first. Replace the IP addresses with that of the nodes in the cluster.
$ ./tpccbenchmark --create=true --nodes=${IPS}
$ ./tpccbenchmark --load=true --nodes=${IPS} --warehouses=1000 --loaderthreads 48
Cluster | Loader threads | Loading time | Data set size |
---|---|---|---|
3 nodes, type 16vCPU | 48 | ~3.5 hours | ~420 GB |
Tune the --loaderthreads
parameter for higher parallelism during the load, based on the number and type of nodes in the cluster. The specified 48 threads value is optimal for a 3-node cluster of type c5d.4xlarge. For larger clusters or computers with more vCPUs, increase this value accordingly. For clusters with a replication factor of 3, a good approximation is to use the number of cores you have across all the nodes in the cluster.
Before starting the workload, you need to load the data. In addition, you need to ensure that you exported a list of all IP addresses of all the nodes involved.
For 10k warehouses, you would need ten clients of type c5.4xlarge
to drive the benchmark. For multiple clients, you need to perform the following steps.
First, create the database and the corresponding tables. Execute the following command from one of the clients:
./tpccbenchmark --nodes=$IPS --create=true --vv
After the database and tables are created, load the data from all ten clients:
Client | Command |
---|---|
1 | ./tpccbenchmark -c config/workload_all.xml --load=true --nodes=$IPs --warehouses=2500 --total-warehouses=10000 --loaderthreads=16 --start-warehouse-id=1 --initial-delay-secs=0 --vv |
2 | ./tpccbenchmark -c config/workload_all.xml --load=true --nodes=$IPs --warehouses=2500 --total-warehouses=10000 --loaderthreads=16 --start-warehouse-id=2501 --initial-delay-secs=30 --vv |
3 | ./tpccbenchmark -c config/workload_all.xml --load=true --nodes=$IPs --warehouses=2500 --total-warehouses=10000 --loaderthreads=16 --start-warehouse-id=5001 --initial-delay-secs=60 --vv |
4 | ./tpccbenchmark -c config/workload_all.xml --load=true --nodes=$IPs --warehouses=2500 --total-warehouses=10000 --loaderthreads=16 --start-warehouse-id=7501 --initial-delay-secs=90 --vv |
Tune the --loaderthreads
parameter for higher parallelism during the load, based on the number and type of nodes in the cluster. The value specified, 48 threads, is optimal for a 3-node cluster of type 16vCPU. For larger clusters or computers with more vCPUs, increase this value accordingly. For clusters with a replication factor of 3, a good approximation is to use the number of cores you have across all the nodes in the cluster.
When the loading is completed, execute the following command to enable the foreign keys that were disabled to aid the loading times:
./tpccbenchmark --nodes=$IPS --enable-foreign-keys=true
Cluster | Loader threads | Loading time | Data set size |
---|---|---|---|
30 nodes, type 16vCPU | 480 | ~5.5 hours | ~4 TB |
Run the benchmark
You can run the workload against the database as follows:
$ ./tpccbenchmark --execute=true --warmup-time-secs=60 --nodes=${IPS}
You can run the workload against the database as follows:
$ ./tpccbenchmark --execute=true --warmup-time-secs=60 --nodes=${IPS} --warehouses=100
You can run the workload against the database as follows:
$ ./tpccbenchmark --execute=true --warmup-time-secs=60 --nodes=${IPS} --warehouses=1000
You can then run the workload against the database from each client:
Client | Command |
---|---|
1 | ./tpccbenchmark -c config/workload_all.xml --execute=true --nodes=$IPs --warehouses=2500 --start-warehouse-id=1 --total-warehouses=10000 --initial-delay-secs=0 --warmup-time-secs=900 --num-connections=90 |
2 | ./tpccbenchmark -c config/workload_all.xml --execute=true --nodes=$IPs --warehouses=2500 --start-warehouse-id=2501 --total-warehouses=10000 --initial-delay-secs=30 --warmup-time-secs=870 --num-connections=90 |
3 | ./tpccbenchmark -c config/workload_all.xml --execute=true --nodes=$IPs --warehouses=2500 --start-warehouse-id=5001 --total-warehouses=10000 --initial-delay-secs=60 --warmup-time-secs=840 --num-connections=90 |
4 | ./tpccbenchmark -c config/workload_all.xml --execute=true --nodes=$IPs --warehouses=2500 --start-warehouse-id=7501 --total-warehouses=10000 --initial-delay-secs=90 --warmup-time-secs=810 --num-connections=90 |
Benchmark results
Metric | Value |
---|---|
Efficiency | 97.85 |
TPMC | 125.83 |
Average NewOrder Latency (ms) | 19.2 |
Connection Acquisition NewOrder Latency (ms) | 0.72 |
Total YSQL Ops/sec | 39.51 |
Metric | Value |
---|---|
Efficiency | 99.31 |
TPMC | 1277.13 |
Average NewOrder Latency (ms) | 10.36 |
Connection Acquisition NewOrder Latency (ms) | 0.27 |
Total YSQL Ops/sec | 379.78 |
Metric | Value |
---|---|
Efficiency | 100.01 |
TPMC | 12861.13 |
Average NewOrder Latency (ms) | 16.09 |
Connection Acquisition NewOrder Latency (ms) | 0.38 |
Total YSQL Ops/sec | 3769.86 |
When the execution is completed, you need to copy the csv
files from each of the nodes to one of the nodes and run merge-results
to display the merged results.
After copying the csv
files to a directory such as results-dir
, merge the results as follows:
./tpccbenchmark --merge-results=true --dir=results-dir --warehouses=10000
Metric | Value |
---|---|
Efficiency | 99.92 |
TPMC | 128499.53 |
Average NewOrder Latency (ms) | 26.4 |
Connection Acquisition NewOrder Latency (ms) | 0.44 |
Total YSQL Ops/sec | 37528.64 |
Largest benchmark
In our testing, YugabyteDB was able to process 1M tpmC with 150,000 warehouses at an efficiency of 99.8% on an RF3 cluster of 75 48vCPU/96GB machines with a total data size of 50TB.
To accomplish this feat, numerous new features had to be implemented and existing ones optimized, including the following:
- The number of RPCs made was reduced to reduce CPU usage.
- Fine-grained locking was implemented.
- Index updates were optimized by skipping index updates when columns in the index were not updated.
- An encoded table's primary key was used as the pointer in the index.
- Continuous background compaction.
- Hot data was stored in the block cache.
- Transaction retries.
Warehouses | TPMC | Efficiency(%) | Nodes | Connections | New Order Latency | Machine Type (vCPUs) |
---|---|---|---|---|---|---|
150,000 | 1M | 99.30 | 75 | 9000 | 123.33 ms | c5d.12xlarge (48) |