I invite you to upgrade to a paid subscription. Paid subscribers have told me they have appreciated my thoughts & ideas in the past & would like to see more of them in the future. In addition, paid subscribers form their own community of folks investing in improving software design—theirs, their colleagues, & their profession. I’ve been augmented coding several database projects, since they are 1) highly complex, 2) highly leveraged if successful, & 3) sitting around in my brain. So far I have a key-value store cooking, an object database backend suitable for Meta’s TAO, and a persistent Smalltalk. These projects got me thinking—what do we mean, in the abstract, when we say “database”? I always visualized rows and columns, perhaps SQL statements or NoSQL document collections. But what are the abstract properties that make something a database? What's the essence of "database-ness"? The systems I’ve been working on don’t need all the machinery of a traditional DBMS. Breaking things down to first principles often leads to insights about where and how to innovate. Here’s what I have so far. Stateful Put & Get APIThe most fundamental property of a database is its stateful nature, expressed through basic put and get operations. After a put operation, what comes back from a subsequent get is different than before. This stateful interface distinguishes databases from pure functions or stateless services. The beauty of this property is its simplicity. No matter how complex the implementation becomes, this core behavior of "I can put something in and later get it back" remains the defining characteristic of any database-like system. Constant(-sh) Time Put & Get OperationsPut & get operations scale with the size of the data being changed, not with the total size of the database. Calling it O(1) over-simplifies a bit but makes the point that the performance isn’t connected to the total amount of data stored.
This scaling property is what allows databases to handle massive amounts of data while maintaining performance. Without it, databases would quickly become unusable as they grow. O(log n) I/O OperationsIn production, I/O is often the bottleneck. The number of I/O operations required for a get or put is O(log n), where n is the number of elements in the database. Reading or writing to a database with a billion records might require only 20-30 disk accesses, not a billion. This logarithmic I/O behavior comes from clever data structures like B-trees and log structured merge trees that economize on disk access in the face of non-random key activity. DurabilityOnce a put has completed, the next get will retrieve that value & not some earlier value, even if the database crashes in between. (This ignores fun facts about concurrency, but Plato wrote single-threaded code so we won’t worry about that right now.) The first implementation of durability is packing data on disk pages & writing it to persistent storage. However, the complexity of the I/O optimizations mentioned above imply that sometimes the data on persistent pages lags the data that has been put. Transaction logs improve durability. Rather than guarantee that eventually data that has been put will subsequently be gotten (“getted”?) (& heaven help you if the system crashes before “eventually” comes), data is written to the transaction log & flushed immediately & synchronously. This immediate memorization of data lets the database preserve the fiction of O(1) put performance. It also shortens the window of vulnerability to crashes. Once the call to flush the transaction log changes returns, the system is free to crash without damaging durability. The next time the system will come up, it will replay the changes from the transaction log before serving data. Constant(-ish) Startup TimeAgain, constant time over-simplifies. The key is that the database begins serving put & get requests quickly regardless of data size. This property distinguishes true databases from simple file-based persistence schemes. If your system has to read and process all data before it can handle the first query, it slows starting up as data grows. What enables this property are clever data structures and algorithms - transaction logs, indexes that don't need to be fully loaded, and mechanisms that defer work until needed. Without constant-time startup, databases would be impractical for large datasets in production environments. WWPT—What Would Plato Think?When I'm trying to determine if something is truly a "database," I apply a simple test: Could I replace it with a traditional database without changing the semantics of the system? If yes, it's probably a database, even if it doesn't call itself one. By this definition, many things have database-like qualities—file systems, key-value stores, even some message queues. The boundaries blur, especially as systems become more distributed. What This Means for DesignUnderstanding these abstract properties helps me think about where I need a "real" database versus where I can use something lighter. It helps me recognize when I'm inadvertently building a database inside my application (usually a sign I should step back and reconsider). Most importantly, it helps me see that "database" isn't a binary concept but a set of properties that exist to varying degrees in different systems. Each property solves specific problems, and I only need to pay for the ones my particular situation requires. These core properties are what databases are truly selling. Everything else is implementation detail. The Properties: A SummaryTo recap, here are the fundamental properties that make something a true database:
When you see these properties in a system, you're looking at a database—even if it doesn't call itself one. You’re currently a free subscriber to Software Design: Tidy First?. Buying me more time to think & write means more thoughts & ideas for you. |