Wednesday, December 4, 2013

The Warehouse and the Shop Floor: Separation of Concerns Based on Data Flow

Today, a cornucopia of NoSQL and Big Data technologies is available to us, each exposing a particular data model and implementing a unique set of features. These different offerings are capable of modeling a diversity of domains and addressing wide-ranging concerns, from scalability to evolvability of the data model. However, when creating a new system or extending an existing one, choosing the right tools for the job can be surprisingly hard. A number of problems arise:

First, technology evangelism encourages a tendency to choose and promote a single technology as the magic bullet, notwithstanding that NoSQL is the antithesis of the “one size fits all” approach traditionally embodied by relational databases. Even when an appropriate tool is initially chosen, it is often subsequently stretched out to handle things it can’t do well as the demands of the system change.
Second, traditional software development practices are not well adapted to the requirements of modern data-centric systems. Building a single monolithic object model that tries to capture in full the richness of the business domain, as well as all the concerns that the system needs to address, is often the default approach for domain design. In the same context, separation of concerns is interpreted as a requirement to hide anything related to the persistence model behind layers of abstraction.
The common factor in these problems is that they suffer from a too-narrow focus on technology and implementation, disregarding the variety and complexity of concerns that data-centric systems need to deal with. The typical outcome of such a situation is that we build applications that end up being unfit for purpose. Resulting issues include failure to meet performance expectations, and difficulty in addressing new requirements.

Separation of Concerns Based on Data Flow

A better approach to designing data-centric systems is to apply separation of concerns based on data flow. Data consumed and produced by such systems commonly originates from a variety of different sources and varies significantly in terms of structure, size and lifecycle. Equally important, the same dataset can serve a number of different purposes within the same system. In the real world, complex systems are often composed of a number of specialised functional areas that collaborate with each other. Generally, each functional area is focused on one task and does not need to deal with the full range of concerns that the system as a whole needs to address.
To illustrate the point, let’s consider a supermarket. Supermarkets are fairly complex systems that continuously handle streams of input and output to deliver value to end users. In a supermarket, goods usually arrive by truck to be unloaded through the loading dock and stored in the warehouse area. Items are then pulled from the warehouse, put on display in the shop window and arranged in the shopping area in a way that makes it easy for customers to browse and find what they are looking for.
The loading dock and warehouse areas are optimised for a very specific job: accepting and storing inbound goods. It would be completely impractical (and potentially dangerous!) to ask customers to do their shopping directly from the warehouse, just because this happens to be the area where the yet-to-be-unpacked wares are stored. Vice-versa, the shop floor is clearly inadequate for logistical activities, but it is in that area that the supermarket gets to serve customers and to deliver its end value. A supermarket as a whole can be seen as an assembly of specialised models that are all important to deliver the complete functionality but that address different concerns at different points in space and time.
Practically speaking, separation of concerns based on data flow can be approached as follows:
  1. Identify the set of concerns that the system needs to address: what do I care about? Do I need to store business events in a scalable way? Do I need a connected model to execute flexible queries?
  2. Identify the locality of these concerns: where are the natural boundaries in the system? Where do I need to deal with scalability? Where do I need to address connectedness in the data model?
  3. Build focused and specialised models within the identified natural boundaries to capture the multiple facets of the system. No one model can address all present and future requirements and concerns; taking a purely top-down approach to design can only create complexity.
  4. Compose the models into a complete system.
This approach may sound counterintuitive first, but it is simply a form of separation of concerns that focuses primarily on data flow across the whole system, rather than on the traditional architectural layers of presentation, business logic and persistence. Indeed, it is not unreasonable, if we commonly accept that computation (algorithms) should be broken down into small focused units, to apply similar thinking to data.
In reality, the same concept is well present within other architectural paradigms. In CQRS for example, there are separate models for reads and updates. Polyglot persistence advocates exploiting different data persistence technologies for different kinds of data. This is indeed very similar to separation of concerns based on data flow as long as we don’t end up with one monolithic model connected to multiple backends, in which case we are not really addressing the crux of the matter.
The main benefit accrued from this form of separation of concerns is true decoupling. Natural boundaries in the system become apparent which helps when breaking it down into reusable components. Each component becomes more aware of the model it operates on, resulting in simplified code, improvements in maintainability and potentially increased performance.

Implementation Strategies

A number of implementation strategies are available to help achieve this. Micro services work well for general-purpose data platforms, where each service can be implemented in way that is closer to the natural underlying model to achieve a higher degree of simplification. For example, a system collecting events related to a telco network can expose two services: one that stores all incoming events in a column-family store, and one that maintains a connected model of the network in a graph database. We can easily imagine other ways those services can be exploited, combined or individually, to meet new requirements.
For simple systems, data models can be co-located within the same application but they do need to be conceptually separate.
If the data is highly volatile such as trades in trading systems, we can’t afford to process and store the data in a prolonged way. In this case, we can use in-memory grids so that the data can be kept in its simplest form (its in-memory object representation in this case) and processed by a number of collaborating services: while one service can be ingesting incoming data on one node, other services can be running real-time queries concurrently.
Other applications of the same principle are always possible depending on the nature of the data. It always depends on the data.
p.s. I gave a presentation during London’s NoSQL Search Roadshow based on the same ideas. The slides can be found here.