In designing applications, there are now more choices in where the structure of the data should reside.
No matter what kind of application you write, at some level you must deal with data. It’s what programs do. But when it comes to organizing and persisting your data, you have some new decisions to make.
For decades, relational databases and filesystems were the go-to options for storing pretty much everything. What we’re seeing now with the rise of NoSQL is the filling-in of the space between. What was once a binary choice is now a continuum.
This article talks about these new options and what they mean for your projects.
Defining the Terms
We can group data storage systems by how much structure we expect from them. We’ll explore these data-organization options by grouping them in this way and looking at them from least structured to most structured. Armed with this knowledge, you should be better able to make informed database genre and schema design decisions.
Before we dive into the first structure grouping, let’s define some terms. For the sake of this discussion, we’ll assume that the whole system can be practically divided into two major components: the application and the datastore. In reality these roles may be serviced by the same software, but we’ll find it useful to make the functional distinction. On the far other end of the spectrum, your application may live on a smartphone with data stored in the Cloud. What matters is this functional distinction between what the final system does (the application) and what it stores (the datastore).
Ready? Let’s start with the zero case.
Option 0: Completely Unstructured
The first and most basic data storage option is to put all the logic about your data’s structure into your application. All of it. Under this model, the only thing your application needs for storage is a place to shove a big blob of bytes and get them back out again later.
This is what a filesystem does; the filesystem application presents folders, files, and permissions through its API. But underneath the filesystem is the datastore, the physical disc.
Practically speaking it’d be overkill most of the time to implement your own filesystem or the comparable application functionality, but it bears mentioning as one extreme option along the continuum of choices.
Option 1: Minimally Structured
The first baby step up from one big bit bucket is a set of labeled buckets. In other words, a key/value (KV) store.
Some key/value stores offer additional options, but at heart KV stores are basically just persistent maps. You provide a key and a value (both blobs) and the KV store puts them on the shelf for you to use later.
Option 2: Advanced Datastructure Support
Your applications will usually benefit from offloading some or all of its data organization responsibility to the datastore. This means you’ll write, test, and maintain less code, giving you more time to focus on the core value your application delivers. Just as you wouldn’t write your own sorting algorithm when a perfectly good library was at hand, you don’t need to resign yourself to working at low levels with your data.
Stepping up the structure ladder from KV stores we find many design archetypes you might choose. These include document datastores, column-oriented databases, and graph databases. Each of these genres offers different viewpoints and tradeoffs with respect to the amount and kind of structure enforced by the system, as well as additional tooling for finding and operating on stored data.
We won’t delve into the details of these options in this brief article, but you should know that this rung of the structure ladder is a broad one, with many choices.
Option 3: Rigorous Schema Enforcement
There is one more option still, and that is to put a considerable amount of structure into the database. We’re now basically at the top of the ladder. Under this model, you expect the database to enforce specific rules that you establish in advance.
By forcing this up-front structure on your data, you free up the database to reason fairly intelligently about your data at the system level. This reduces the amount of intelligence you have to implement in client applications.
Relational Database Management Systems (RDBMSs) are the canonical example of this kind of structure responsibility. Also here you’d find certain inverted index implementations that require documents to be typed, have certain required fields, etc.
Implications for Your Project
So what does all this mean for your application? Unfortunately, it means more work. You have more decisions to make. In a nutshell, it’s your duty as a creator of software to understand and decide which of these tiers makes sense for your use case and then find something that fits.
The once simple line between structure imposed by the database and structure maintained by the application is shifting and expanding. That means new decisions for you to make, but it also means more power in your hands. On the one hand, it’s now more complex to decide what your stack is going to look like. But this plethora of options means there are probably better choices for your applications than you might have considered before.
And that, ultimately, means better apps.
Jim R. Wilson started hacking at the age of 13 and never looked back. He began tinkering with NoSQL databases in 2007 and has contributed code to large-scale open source projects such as MediaWiki and HBase. You can learn much more about the various genres of databases and what they offer and how to use them from his book Seven Databases in Seven Weeks.