May 29th 2014 Modeling States and Transitions in Relational Databases
One common problem in software development is how to model states and the transitions between them in a database—essentially, to persist a finite state machine. The quintessential examples of this problem are version control systems, which save the state of files over time, allowing users to examine the exact state of those files at any given point in their history. We can think of the contents of files at various points in time as states, and the edits that were made to those files as the state transitions.
There are two basic ways to persist this history to disk: saving the transitions, or saving the states. Subversion takes the former approach, saving the transitions as the diffs of files, and using those diffs to calculate the actual contents of the files at given points in time. Git takes the opposite approach, saving snapshots of the file contents at discrete points in time, and using those snapshots to calculate the file diff that led to another snapshot.
But which system is best?
For a while, I worked for a startup that specialized in allowing users to purchase and trade digital cards. (Think baseball card collecting, but online.) Originally, the application modeled trades as a series of trade events. A trade could be in one of four states, which corresponded to the actions users could take on those trades:
- Proposed, for trades that had been offered;
- Accepted, for trades that had been accepted by the recipient;
- Rejected, for trades that had been declined by the recipient;
- Withdrawn, for trades that were canceled by the original user before the recipient could take action on them.
Trades could also be countered by the recipient, although this was not modeled as a fifth state; rather, it was modeled as a trade with the offeror and recipient flipped.
The entire flow can be illustrated with a relatively simple state diagram:
Offering a trade created a trade event (a
Trade object) with a status type of “Offered” in the database; when the recipient of the trade offer did something with the trade (accepted or rejected it, for example) another trade event was written to the database. In other words, each possible state was represented as a separate
Trade object. Here’s a (simplified) diagram of the object model used in this system:1
This system was nice because it created a log of the trade’s events, but due to the way it was modeled, as well as the requirements of the application, querying the database for trade information was very inefficient. Specifically, it required a lot of joins simply to get the final state of a trade, and even getting a specific user’s pending trades—an essential feature of the application—was slow. Querying for a user’s incoming trades (that is, trades that they had been offered but not yet acted on—state P in the diagram above) required a query like the following:2
Getting the items being offered in a trade was also not easy—even for a single trade:
Worst of all, that query had to be done for every incoming trade.
Rewriting the Trading Module
Obviously this system, despite how mathematically beautiful it may have been, was inefficient, and not scalable for the number of users on the site.3 I was tasked with improving this module’s performance. And that led to one of the most embarrassing bits of code I’ve ever written.
I rewrote this code to be mutable, modeling the flow like this:
Many queries became much more efficient (and less complex to write, which is a secondary concern, but still important—if your programmers don’t even understand how to read and modify your queries, you’re in trouble). Getting a particular user’s incoming trades was fairly straightforward:
Unfortunately, as it turns out, getting the items offered in a particular trade was not that much easier:
And that was just the start of the problem. The new way of modeling trades was very inflexible, and the code for managing the state of trades in the database was much more complex and even harder to reason about.
Events would actually mutate fields in the trade model. This made it easy to get the final state of the trade, as well as to retrieve a given user’s incoming and outgoing trades, with relatively simple SQL queries, but it raised a lot of new problems:
The code was horribly complex, because it had to deal with mutating an existing object, which is not as straightforward as simply creating a new one.
It did not scale well and was not future-proof. For example, when new state transitions (types of events) were added, new code had to be written. Eventually, we also decided that we did want to show the history of the trade, so a pseudo-log had to be hacked onto the existing models (that’s the
TradeActionmodel in the diagram above). A similar problem occurred when we decided we wanted to allow users to attach messages to trades.
There was no natural log of actions taken on a trade that led to its final outcome, which made it harder to audit trades and track down bugs, both of which are of utmost importance in a system with paying customers. (A log was hacked onto the models, but at greatly increased code complexity, since the
TradeActionmodels had to be kept in sync.)
It offended my functional programming sensibilities, which lead me to find immutable data structures to be beautiful, and mutable ones ugly.
The new module was not an improvement; in fact, it was a nightmare that caused me no end of headaches.
Rather than trying to capture all the steps that led from one state to another, we can take Git’s approach and snapshot the trade at each step along the way; knowing the final state at each step, we can infer the action that led to that state fairly easily.
Like the original system, each trade object will have a link to another trade, allowing us to walk through the history of the trade (which gives us an audit trail and allows us to show the history to the user, both requirements of the system). Which way should the links go, though? To the next state, or back to the previous one? Just like Git, we’re primarily interested in the final state of the trade, and in the rare case that we need the history, it’s acceptable—even desirable—to walk backwards through it. Also like Git, making links point back in time, to the previous state, means we don’t have to modify old records—we just create new ones.4
With that design decision out the way, we can plan out our database schema:
Although similar to the original implementation, we’ve taken the opportunity to do a bit of database denormalization, which should avoid the performance issues of the original implementation without sacrificing its flexibility. Getting a user’s incoming trades is even easier (and faster):
Retrieving the cards offered to a user still requires a number of joins, but it seems from our exhaustive analysis that this is unavoidable:
A Tangent on Optimization
There’s one interesting model in our database schema: the
TradeOffer, which consists of nothing more than a database ID. It looks like the through table for a many-to-many relation, but you’d expect to see a structure like this:
So why the odd structure? Recall that our system includes the notion of a countered trade, which occurs when a recipient modifies the items offered in a trade and re-proposes it to the original offeror. Instead of having a separate status for a countered trade, we treat a countered trade as a rejected trade, following by a proposed trade. (We can identify countered trades because they are the only trades that have a status of
proposed and a link to a previous trade.) If we had a more traditional many-to-many through table, we’d not only have to modify the
Trade when creating a event, but we’d also have to create an entirely new
TradeOffer object. Using the “weird” structure, we simply have to update the
recipient_offer_id links when creating a new
It’s worth noting that this does not necessarily improve the performance of our database queries, but it does serve to make our theoretical application code a lot cleaner. Optimizations are not always performed for the benefit of database performance, or even code performance; writing cleaner, potentially less buggy code is a benefit of some optimizations as well.
Modeling events and the transitions between them is a frequent problem for software developers. They’re concrete instances of those finite state machine models that were drummed into you during your computational theory classes. And it’s easy to get the persistence layer of those models wrong. But with a bit of knowledge and a lot of requirements gathering (one of any developer’s most important jobs), it’s easy to the model right.
Maybe you’ll even get it right on the first try.
- I’ve simplified the diagram to remove related models that aren’t relevant to this conversation, but the actual implementation was even more complex. ↩
- The system was actually implemented in Django, and so used the Django ORM; I’m also doing these queries from memory, without any way to test them, so forgive me if I make a few mistakes. They should show the general difficulties faced in querying this system, however. ↩
- Not that the site ever had a lot of users, but querying was inefficient even for small n. ↩
- Git commit hashes are calculated from the contents of the commit object, which means that commit objects cannot be modified, or the commit hash would change; that’s one reason that Git commit objects do not store links to children. The comments to this Stack Overflow answer explain more. ↩
- You can think of this as being along the same lines as database denormalization, although I don’t know if it technically qualifies as such. ↩