Designing Data-Intensive Applications
Martin Kleppmann
O’Reilly, 2017
With few exceptions, any interesting software system (i.e., more than a toy or demonstration) will operate on stored data. After all, if the data is not stored, then it cannot survive the system restarting. And sooner or later, every system is restarted.
Storing data comes with some notable challenges. One is that writing data to a storage system, and reading it back from that same system, is slow. Not slow in an absolute sense, but slow relatively to processings speeds. For many systems, storage access speeds are a major contributor (positively or negatively) to overall system performance.
Another challenge is that stored data is persistent, whether it’s right or wrong. If a program performs a calculation incorrectly, the bug can be fixed, the calculation rerun, and the next output will be correct. But fixing the bug won’t mitigate the fact that the first, incorrect output was stored somewhere. To build a reliable system, then, it must, in some sense, be prepared to correct for all the bad data persisted by every buggy version that came before.
Storing data is hard for many other reasons, but these two challenges alone should make the point. For these and all the other reasons that working with data is hard, I recommend that every architect read Designing Data-Intensive Applications.
Don’t let the phrase "data-intensive" fool you into thinking you won’t find this book useful. Even if your system only stores data locally, you will benefit from the discussion of storage technologies in Part I. Where the book really shines, though, is for those building multi-user, multi-tenant, or multi-region cloud-based systems. In Part II, you’ll find the most extensive, most practical advice I’ve encountered in one place on managing data storage in cloud systems. The point is not that any system will encompass all the knowledge in this book. Rather, use it as a guide to evaluate which design will work best for the system you’re building.
Finally, in Part III, Kleppmann discusses derived data. The data storage systems we design and use for the applications that produce the data we store are often not the best systems for analytical processing of that data. Thus, most cloud systems store their data at least twice: once in the "transactional" data store that applications read and write, and a second time in a "derived" form, in another data store designed for that purpose. Here, again, there are different storage technlogies at play and processing approaches to consider.
This book is an essential read for any architect. Too much of the time, software architecture is equated with the design of a codebase. As a result, the world has given us a surfeit of books on structuring code. (Indeed, most of the books listed on this site are exactly that: a discourse on how to structure code.) And yet, to practice software architecture, one must attend to code and data. There are few books that tackle designing for data at all; we’re all fortunate that this one is such a good specimen.
© 2025 by Oliver Goldman