UNDONE: add back tags/series before publication

Introduction

In the world of concurrency control there are two approaches: pessimistic and optimistic concurrency control. Optimistic concurrency control assumes there are no conflicts and performs validation at the end of the transaction. While pessimistic concurrency control assumes there will be conflicts and detects those conflicts at the time of occurrence.

A benefit of detecting conflicts at the time of occurrence is transactions have flexibility in choosing what to do. The two well known options are waiting and aborting. There is also a third less discussed option, continuing execution. The transaction can speculate that the conflicting transaction will succeed and proceed as if it was committed.

The general approach to implementing pessimistic concurrency control is through locks. If one is unable to acquire the requested lock then there is a conflict. If the lock is granted, there are no conflicts and there will be no future conflicts. In the world of pessimistic concurrency control and speculative execution, there is the established approach of early lock release which I will discuss. And the more recent idea of controlled lock violations. I propose how one might extend traditional multi-version pessimistic snapshot isolation to support controlled lock violations.

Multi-Version Two-Phase Locking Snapshot Isolation

This section is a recap of well established ideas for implementing snapshot isolation with with pessimistic multi-version concurrency control. For the purposes of this post, multi-versioning is implemented as a version list of values that constitute a row. Additionally, it is assumed there is some sort of deadlock prevention/detection component (e.g. locks implemented as wait-die/wound-wait).

Combining multi-versioning with pessimistic concurrency control is very effective for snapshot isolation. Read transactions do not block read-write transactions. Reads work simply by doing a read of the latest value with CommitTs <= StartTs. No concurrency control is needed. Only read-write transactions need concurrency control.¹

For snapshot isolation, we need to detect and prevent write-write conflicts. Write-write conflicts are when another transaction, Txn B, writes to the same row as Txn A, such that StartTs_A < CommitTs_B < CommitTs_A. In plain english, Txn B inserted a row that was not visible to Txn A, so when Txn A tries to write to the row it will have a write-write conflict. In order to implement this with pessimistic concurrency control, we only need exclusive locks. Once Txn A acquires an exclusive lock of the row, it is guaranteed that no other transaction will commit a value with a commit timestamp less than Txn A.

A read-write transaction will perform their reads as of the transaction's start timestamp. As with read transactions, visibility is based on CommitTs <= StartTs. Writes are performed as of the transaction's commit timestamp. There is a detection phase and then a prevention phase for write-write conflicts. First, the transaction determines if any of its writes conflict with already committed writes. It asks, is there a committed value with CommitTs > StartTs?

In Figure 1, the Txn B encountered a value with CommitTs > StartTs so it fails with a write-write conflict.

Figure 1. Write-Write Conflict.

Now, lets look at the happy case where Txn A sees no write-write conflicts. It acquires the row lock and inserts its uncommitted value into the version list. Because it does not have a CommitTs, no reads or read-write transactions have visibility of the uncommitted data (except for Txn A itself).

Figure 2. Row lock acquisition.

Then Txn B comes and wants to perform a write. It can't because Txn A owns the lock, so it waits. It is not a write-write conflict because the value is still uncommitted.

Figure 3. Row lock waiter.

When Txn A commits, all of the waiters now have a write-write conflict so they abort. CommitTs > StartTs for all waiters.

Figure 4. Transaction commit.

If Txn A aborted instead, the first waiter, Txn B, will now own the row lock and can now insert its value into the version list and continue making progress.

Figure 5. Transaction abort.

Using PCC with multi-versioning eliminates the need for transaction private buffers for writes. Uncommitted values are able to be written in-place on the versioned list.

Arvola Chan, Stephen Fox, Wen-Te K. Lin, Anil Nori, and Daniel R. Ries. 1982. The implementation of an integrated concurrency control and recovery scheme. In Proceedings of the 1982 ACM SIGMOD international conference on Management of data (SIGMOD '82). Association for Computing Machinery, New York, NY, USA, 184–191. https://doi.org/10.1145/582353.582386

Strict Two-Phase Locking

To prevent dirty reads and cascading aborts with a simple implementation, strict two-phase locking can be used. Locks are acquired throughout the transaction, the growing phase. Locks are only released after the transaction is durable. The next section goes into the implications of releasing locks before the transaction is durable.

Early Lock Release

As one can surmise, it is beneficial to hold these locks for the shortest time period possible. The longer a lock is held, the longer conflicting transactions are blocked reducing concurrency. At the cost of more complexity, we can optimize how we hold locks after the growing phase to enable higher concurrency without dirty reads. There can be cascading aborts, albeit unlikely.

Because there is still a growing phase, the earliest we can release and still stay consistent with 2PL is after the commit request, no more locks will be acquired. This means locks can be released during the hardening stage. The hardening stage can be the dominant part of a transaction in OLTP workloads so enabling transaction progress during it can greatly increase performance during conflicts.

In figure 6 below, the blue line is the duration of a single node transaction. In strict 2PL the locks are held for the entire duration. With early lock release we only hold it for the green line as described above.

Figure 6. Single Node Lock Release.⁵

Releasing locks early can result in cascading aborts. However, in single node transactions we are doing the early lock release during the hardening stage. Generally, the only way a transaction aborts in this stage is if the database crashes. Therefore, cascading aborts do not matter. Waiting is strictly worse than speculatively proceeding, because if the database crashes the waiters would also abort.

A traditional early lock release implementation releases locks after determining its commit LSN. Assuming LSNs are strictly increasing, this creates a barrier. All transactions, acquiring the lock after the release will have a later commit LSN. Thus it satisfies the following:

B can only commit after A commits.

However, this does not hold for read-only read-write transactions that do not commit anything and thus do not acquire a commit LSN. Imagine Txn A releasing its locks, Txn B returns those rows, and then the database crashes before Txn A is durable. That is a dirty read. This is what MySQL InnoDB implemented for early lock release. They released locks before the fsync. Therefore they cautioned about dirty reads: B can return some of A's data, and then there is a crash before A commits.² However, InnoDB reverted early lock release because even with that known limitation, it did not work with their crash recovery protocol.³

Therefore we need a new rule:

B can operate on A's data inside the DBMS, but it must not return any of A's data to the client until A commits.

This is the rule that prevents dirty reads. This check can be implemented as a tag in the lock that stores the commit LSN of the latest release. In a read-write transaction that only performed reads, it will wait for the commit LSN to become durable.³

However, the high watermark is not efficient if the log supports out of order commits. Imagine the following scenario. Txn B violated one of Txn A's locks. Txn A is committed, but earlier in the log is a large Txn Z that is still committed. Txn B can't commit because it needs to wait for the high water mark to reach Txn A.

Txn Z (slow, committing) | Txn A (committed) | Txn B (waiting for HWM)

With a high watermark implementation, it may be faster for Txn A to not allow the lock violation so Txn B could immediately acquire the lock and commit. For correctness, this also assumes the out of order log will crash if one of the commits fails. Otherwise high watermark does not mean all transactions below are committed.

Additionally, this simple implementation does not support efficient two-phase commits. In two-phase commit there is the prepare phase and then the commit phase. Using high watermark only allows violations during the hardening of the commit phase. We are holding the much longer than necessary (the green line instead of red line).

Figure 7. 2PC Lock Release.⁵

Therefore, we want a third rule to handle 2PC.

B must abort if A aborts.

We are assuming here that 2PC aborts are rare because they can cause cascading aborts on failure. Unlike in the single node transaction, where waiters would also have been aborted (because of crash), waiters on the 2PC abort would still make progress. In the happy case though of 2PC succeeding, a 2PC transaction holds locks for the same amount of time as a 1PC transaction. There is still the overhead of a longer time to commit, but conflicting transactions are able to make progress in parallel.

If 2PC aborts have periods of high occurrence, the DBMS could dynamically switch from early lock release during the prepare phase to early lock release during the commit hardening phase, the green line. This would reduce concurrency but eliminate cascading aborts due to 2PC aborts.

MySQL Early Lock Release.

InnoDB Early Lock Release Bug.

⁴

Kimura, H., Graefe, G., & Kuno, H.A. (2012). Efficient Locking Techniques for Databases on Modern Hardware. ADMS@VLDB.

Controlled Lock Violation

Controlled lock violation is a slight reframing of early lock release. Instead of releasing the lock, the lock is still held and instead allows violations. One of the key benefits of controlled lock violation there is full visibility into the types of locks being violated. The paper focuses on the benefits this information has when using different lock types, hierarchical locking, and intention locks. There is a heavy focus on B-trees, and the paper's implementation is a boolean allow violation flag on a non-versioned B-tree.⁵

For our purposes, we are exploring how one can use the controlled lock violation ideas in the narrow context of multi-versioning and snapshot isolation. The benefits of having lock type information are unused because we only have a single lock type, exclusive locks.

An interesting observation is multi-versioned snapshot isolation as I talk about in this post does not really have the ability to do early lock release in the first place. The locks are embedded in the version list (through uncommitted values) and there is only a single type of lock, so it is impossible to "release" if a write is performed. The version list keeps the lock history.

⁵

Goetz Graefe, Mark Lillibridge, Harumi Kuno, Joseph Tucek, and Alistair Veitch. 2013. Controlled lock violation. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13). Association for Computing Machinery, New York, NY, USA, 85–96. https://doi.org/10.1145/2463676.2465325

Multi-Version Lock Violations

To support controlled lock violations, our version list needs to support the following:

Multiple uncommitted values that are dependent on each other.
Timestamps to determine uncommitted value visibility.

To do this, our versioned values include a new field, ViolationTs. This is the timestamp at which the locked uncommitted value allows violations. Violations are allowed after the execution phase when the commit request is sent, which keeps 2PL intact.

Read-only transactions are unchanged from the original design. They still determine visibility using CommitTs: the latest value with CommitTs <= StartTs. This is what ensures read-only transactions do not read uncommitted data and incur commit dependencies.

Read-write transactions use ViolationTs for visibility: the latest value with ViolationTs <= StartTs. Write-write conflicts are now defined using StartTs_A < ViolationTs_B < ViolationTs_A. This is the reduced time period of potential write-write conflicts that we want. It works because when ViolationTs_B <= StartTs_A, A has visibility of B. Using ViolationTs instead of CommitTs means B is possibly uncommitted, which is what the commit dependencies track. Locks guarantee there is only a single possible future that is being speculated and thus is visible.

Write Locking

Lets run through some scenarios to see how writes work in practice. The initial state, seen in figure 8, has two committed values, and an uncommitted chain with three values. As stated above, each versioned value has a ViolationTs and a CommitTs (invalid values if not committed/allowing violations yet). Txn A currently owns the exclusive lock on V2 (it was the transaction that inserted V3). Txn H StartTs < V3 ViolationTs and Txn J StartTs < V3 ViolationTs so they can only read V2. Therefore, they wait on V2 to be released instead. V5 which owns Txn C has not been released yet.

Figure 8. Initial state with controlled lock violation.

Figure 9 shows the outcome of Txn A committing. Similar to strict 2PL snapshot isolation, all the waiters abort with a write-write conflict. But, because we speculatively made progress on the future based on Txn A, we are able to continue with the uncommitted list. That is the key component of controlled lock violations. Additionally, V5 started allowing violations.

Figure 9. Transaction commit.

Next up GC was performed removing some of the lower committed versions to make the graphic easier for the reader. You can see that Txn B committed V4. Txn D (StartTs = 7) acquired the lock on V5 to insert V6. Notice that the read timestamp is less than V4 CommitTs=9. This is why using ViolationTs instead of CommitTs for visibility choices is important. The commit timestamp is greater than the timestamps in the uncommitted chain. But our read-write transactions need to have visibility into the uncommitted chain.

Figure 10. Transaction commit.

Finally a demonstration of a cascading abort. Txn C aborts which also causes Txn D to abort. Because Txn K was waiting on V4 which Txn C originally owned, it was able to acquire the exclusive lock and insert its own V5.

Figure 11. Cascading abort.

As you may have noticed, we kept the property that the uncommitted workspace is still the final version list. Writes do not need to operate on a private transaction buffer as there is only one potential future at a time. The single potential future is a key aspect of our speculative execution. There is only one potential future so transactions are able to easily perform snapshot reads of that potential future.

Waiter Behavior

The premise of lock violations is that it is extremely rare for transactions to abort once violations are allowed. Thus, waiters might as well produce the write-write conflict at violation time instead of commit time. By producing the write-write conflict as early as possible, it will force the transaction/user to retry faster. The retry will obtain a later read timestamp and will have access to the uncommitted version to make progress. Note that this approach is not illustrated in the scenarios above (write-write conflicts were detected at commit time).

Violation Atomicity

For correctness, allowing violations does not need to be performed for all writes at the same time. This is because a transaction will end up waiting for the outcome if the lock does not allow violations. However in practice allowing violations should be performed atomically. Take the scenario in figure 12 that uses multiple violation timestamps.

Figure 12. Multiple violation timestamps.

If Txn A commits, Txn B will abort because it is a waiter on Txn A in Version List A. If Txn A aborts, Txn B will abort because it is dependent on Txn A in Version List B. It is a guaranteed write-write conflict. If the violation timestamp was set atomically for all writers this would be avoided.

Visibility of violations does need to be atomic though, just as commits. Otherwise Txn B would be a waiter on Txn A's V2 even though the ViolationTs <= StartTs.

Violation Timestamps

You may have noticed that violation timestamps do not interact with commit timestamps. Transactions that read using violation timestamps do not care about commit timestamps, and vis versa. Therefore, to reduce contention on the timestamp generator, the clocks can be separated. One clock for violations and another for commits. Read-write transactions get their read timestamp from the violation clock. Read transactions get their read timestamp from the commit clock. With this separation, commits and violations don't contend when incrementing.

Commit Dependencies

A handy benefit of the version list is commit dependencies are embedded in it. Transactions do not need to keep track of write dependents as during lock release they will visible in the version list.

However, the write set is not necessarily the same as the read set of a transaction. We also need to track commit dependencies for read rows. This is because we may be reading uncommitted data. So the transaction must wait for all commit dependencies to resolve before making its own commit decision.

We can implement this using the approach from High-Performance Concurrency Control Mechanisms for Main-Memory Databases.⁶

Each transaction, T, will have three new fields:

ReadCommitDepSet: A set of all the read transaction ids that depend on T.
CommitDepCounter: A counter of all (read + write) transactions that T depends on.
AbortNow: A boolean flag that other transactions can set to tell T to abort.

When T acquires an exclusive lock through controlled lock violation, it increments its own CommitDepCounter. When T performs a controlled lock violation for a read on another transaction, it increments its own CommitDepCounter, and adds itself to the other transaction's ReadCommitDepSet. It only does this for the value that is actually read (it may have iterated through X uncommitted values before it).

During a commit/abort, T will iterate through its ReadCommitDepSet and decrement the transaction's CommitDepCounter. During lock release, it will also decrement the next value in the version's list CommitDepCounter. If it was an abort, it additionally will set the AbortNow flag on all the transactions.

In the following scenario, Txn A has not performed any lock violations so CommitDepCounter = 0. Txn B violated the lock and read the V2 from Txn B, so it added itself to the ReadCommitDepSet and incremented its own CommitDepCounter.

Figure 13. Read generating a commit dependency.

If the database does not support two-phase commit and does not use an out of order log, it can be simpler to implement the read dependencies using high watermark. There would still need to be the CommitDepCounter and AbortNow for write dependencies unless those also participated in the use of high watermarks.

⁶

Per-Åke Larson, Spyros Blanas, Cristian Diaconu, Craig Freedman, Jignesh M. Patel, and Mike Zwilling. 2011. High-performance concurrency control mechanisms for main-memory databases. Proc. VLDB Endow. 5, 4 (December 2011), 298–309. https://doi.org/10.14778/2095686.2095689

Write Row Return

When returning rows from a single-statement write, e.g. using PostgreSQL's Returning⁷ or Microsoft SQL Server's Output⁸ clauses, the behavior depends on the guarantees you want to provide.

If the results are of committed writes only, the inserted rows should return after the commit is durable as expected. If the behavior is similar to MSSQL Server,

An UPDATE, INSERT, or DELETE statement that has an OUTPUT clause will return rows to the client even if the statement encounters errors and is rolled back. The result shouldn't be used if any error occurs when you run the statement.⁸

then it is possible to stream the results back immediately for single-statement transactions. This is because the rows will be guaranteed to be committed if the statement is successful, because implicit commit will resolve the commit dependencies.

⁷

PostgreSQL Returning Clause

⁸

Microsoft SQL Server Output Clause

For Update

For update queries need to acquire the exclusive lock without installing a new value, an exclusive read lock. This means it needs to be a read-write query. My understanding of the semantics of a for update is exclusive access for the duration of the execution phase of the transaction. As long as the semantics are not "until the transaction is durable", the transaction can release the lock early. Unlike write queries, no new value is inserted so when the transaction determines its violation timestamp, the transaction will perform a true early lock release of its for update locks. No ViolationTs needed and no causing of write-write conflicts.

Unlike write row return above, returning rows from a for update can incur commit dependencies that we need to resolve. Because there are no special semantics about returning rows to the client in a for update query, we need to prevent dirty reads. See the section below on preventing dirty reads for handling commit dependencies when returning rows to the client.

I have not put much thought into shared for update queries. They require a shared lock which changes concurrency control significantly.

Two Phase Commit

It is important that when resolving read-only sites in two-phase commit, that commit dependencies are resolved. The transaction may have read some uncommitted data from the read only site, and made decisions on it or inserted it into a different site. They are dirty reads. Therefore, the prepare on the read-only site must fail if any of the commit dependencies fail.

Multi-Statement

With multiple statements, even if the first query is a read, subsequent queries may write. Thus multi-statement transactions have to use read-write semantics for all queries, even read-only queries. This is because each transaction type has its own visibility logic. Switching between the two for read/write statements would mean the transaction is seeing two different snapshots.

Preventing Dirty Reads

This means read statements in a multi-statement may have commit dependencies from reading dirty rows. The DBMS can operate internally on these rows without waiting for known outcomes. However, commit dependencies must be resolved for rows returned to the client to prevent dirty reads. The transaction needs to wait for the uncommitted row to become a committed row.

The wait for resolution of commit dependencies should be lazily performed. Query execution is likely to read more rows than actually returned to the user. Some form of row tainting may be a solution here. The transaction could taint the row with a txn id if it is uncommitted so when at the final stage of returning the row, a check/wait can be performed. If the query engine passes through the versioned value, then it already has access to the transaction to check/wait on.

An illustration of this is in figure 14. The query engine is processing a query associated with Txn B. It wants to return V2 to the client. But, V2 is associated with Txn A which is not yet committed so the query must wait. The query engine will block on Txn A until it is committed and its rows can be forwarded to the client.

Figure 14. Returning a row that has a commit dependency.

Of course if Txn A instead aborted, then Txn B would also abort.

A nice optimization is when returning rows that the transaction wrote, one does not need to wait for commit dependencies. This is because the client is expecting to receive uncommitted data. This also applies to the write row return statement, because the write statement is not expected to commit. The rows returned will always be uncommitted. So regardless of the guarantees single-statement write row return provides, in the multi-statement no waiting is necessary.

New Multi-Statement Types

There are two new types of multi statement transactions that can be useful with controlled lock violation.

Read only multi-statement transactions

Because read queries have higher overhead in read-write transactions due to commit dependencies, having read-only multi-statement transactions would allow for multiple read queries as of the same timestamp with zero overhead. They would use the CommitTs visibility rule for all the queries which prevents commit dependencies. As the name implies, attempting to do a write query in a read-only multi statement would produce an error.

Dirty read multi-statement transactions

Sometimes a multi-statement transaction that does reads and writes does not care about dirty reads. Imagine a transaction that reads some rows, does some processing outside of the database on them (e.g. building a full-text index or an embedding), then inserts the processed data back into the database. The client does not care that the data may be uncommitted because the data is only being used in the context of the transaction. The transaction commit will fail if one of the commit dependencies aborted. It can be thought of as extending the internal DBMS' use of dirty reads to the user transaction. These types of transactions should be able to opt out of the wait for commit overhead.

Multi-Version Lock Lists

While the focus of this post was on a multi-version list which embeds the locks, what if the locks were in a lock manager and not embedded? If you generalized the preceding proposal, it can be seen as a multi-version lock list with visibility determined by ViolationTs.

In figure 15 there is a list of locks with Txn A owning the first one in the history. A transaction is able to violate if StartTs >= ViolationTs. Txn D has not allowed violations yet, so Txn K must wait.

Figure 15. A multi-version lock list.

Similar to how we multi-version data, I am proposing we extend that to our locks. For this to work, lock visibility and data visibility are tied together. This generalized idea of multi-versioning locking is something I propose as food for thought. It is a very undeveloped idea and has a bunch of unanswered questions, but it seems like a possible approach to implementing controlled lock violations for other lock types, e.g. hierarchical locks.

Wrapping Up

In the preceding sections I gave a crash course on multi-version two-phase locking snapshot isolation, went into the background of early lock release and what is required for proper speculative execution, gave an overview of controlled lock violations, and then introduced how one might implement it for multi-version snapshot isolation.

I approached the topic from a very narrow point of view, by investigating exclusive locks. A real database might will have more lock types, secondary indexes, etc. Also, I only covered snapshot isolation. Serializable isolation is desirable which requires read locks and uses different visibility logic. How these would work with controlled lock violations and multi-versioning is a topic for another time.

And with that this post is finished. I hope you learned something or at least got some good pondering out of it. My DMs and social accounts are always open if you have your own thoughts on the topic!