Risk of data loss when upgrading to or from version 2.20.3.0 or 2.21.0
Product | Affected Versions | Related Issues | Fixed In |
---|---|---|---|
YSQL | v2.20.3.0, v2.21.0 | #22057 | Planned v2.20.3.1, v2.20.4, v2.21.1 |
Description
During a rolling upgrade of a YugabyteDB cluster to or from either the v2.20.3.0 or v2.21.0 releases, if the cluster experiences an active YSQL write workload, there is a risk of data loss caused by an issue related to the row-locking feature.
Mitigation
If you have upgraded to or from 2.20.3.0 or 2.21.0, contact Yugabyte Support for steps to identify which tablets have been affected, and how to fix them.
If you have created a new universe on v2.20.3.0 or v2.21.0, run the following steps to ensure the issue does not happen when upgrading to a different version.
-
Manually override the YB-TServer flag
ysql_skip_row_lock_for_update
to false, using the JSON flags override page as follows:{"ysql_skip_row_lock_for_update":"false"}
-
Upgrade the universe to a version with the fix.
-
After the upgrade is successful, the YB-TServer flag override for
ysql_skip_row_lock_for_update
can be removed safely.
Details
v2.20.3 introduced the row-locking feature to address issues arising from concurrent updates. With this enhancement, UPDATE operations currently acquire a row-level lock similar to PostgreSQL, instead of per-column locks. As a component of this feature, there was a subtle adjustment made to the raft-replicate message handling for the related DocDB write operations.
During rolling upgrades, certain nodes will operate using the previous version while others use the new version. This setup may lead to a scenario where the leader node runs the new version (employing row locking), while at least one follower node remains on the older version (using per-column locking). In such cases, the older follower might receive a raft-replicate message containing new logic from the updated leader. Nodes on the old version may mishandle writes generated by the new version, potentially resulting in corrupted table data or data loss. If the affected follower later becomes the leader, you may observe missing data for the updated rows.
Additionally, there will be data inconsistency between tablet replicas (new version versus old version) during this transition period. This inconsistency will persist until the affected rows are fully overwritten. Consequently, you may observe either the presence or absence of rows depending on which replica is serving as the leader at any given moment.