1. Introduction
In this article, I'm going to introduce you to distributed systems. These systems are ubuquitous today, even though we may not know that we are dealing with one when we use them. And that's that whole point of distributed systems. In a well designed distributed application all the complexity should be abstracted away from the end user.
I also talk about the different kinds of systems that have been deployed today and their applications. I also discuss the goals of a distributed system and how we can ensure that we design a system that achieves these goals.
2. The view from 30,000 feet.
Recent years has seen the proliferation of a large number of distributed systems that are in operation. These systems are not just the obscure system operated by large enterprises to manage their business. These are systems that you and I use almost on a daily basis. Google search engine, YouTube, LinkedIn, Facebook, Twitter, Instagram, WhatsApp, etc. are all examples of distributed systems that fill very different needs for it's users.
These systems are unlike the High Performance Computing (HPC) clusters that underwent their own boom in the 90's. HPC's are useful for compute intensive tasks like simulating physical processes, weather forecasting, orbital mechanics for space missions and many more. These tasks require a very large amount of CPU processing capacity that the HPC clusters are well suited to provide.
This article however, is not about compute intensive tasks that the HPC clusters handle. It's about a newer class of problems that have come up in recent years.
In the early 2000's, a few things were going on that created the perfect condition to give birth to this class of distributed systems. Disk storage capacity cost fell off a cliff, commodity CPU's became multi-core, and the rise of ultra-personal computing in the form of smartphones. It was a marriage made in heaven. Smartphones had the capacity to generate huge amounts of data, we could process the data more efficitnely using multi-core CPU's and cheap storage meant we could actually save the date before we could decide what to do with it.
Once we were able to store the data, we had to figure out how to use it.
Enter BigData. Even though BigData and distributed systems in general are not the same thing, they are closely related. BigData tools are a sub-component of today's distributed systems. We wouldn't need distributed systems if we didn't have access to a huge amount of data that could not be saved on one system.
Distributed databases that form the backbone of the BigData tools out there also form the backbone of today's distributed systems.
3. So what is a Distributed System anyway?
Formally, a distributed system can be defined as .. a collection of autonomous computing elements that appears to it's users as a single coherent system.[1]
The fact that the application connects to different servers based on where the user is connecting from, should be transparent to the user and work seamlessly.
Informally we can consider a system as distributed if it allows multiple users to utilize and share it's resources and functionality at a given time. In addition such a system has the ability to serve an increasingly growing number of clients by adding more hardware resources. In most modern systems hardware resources are extra hardware nodes that distribute the load of increasing clients.
What makes distributed systems so fascinating in my opinion, is not that they provide functionality to a large number of it's users, but the fact that they do so by building upon multiple underlying components, most of which work on the assumption that the underlying hardware and software components are prone to error and failure.
A distributed system is thus expected to continue working in the face of node failures, disk drive failures, network outages, human error and unexpected user interaction.
Software engineers and system architects are expected to know various design patterns utilized in distributed systems and to have a technical understanding of the various underlying components that make up distributed systems, along with their tradeoffs.
But even more important than the technical details of the technologies, are principles of distributed systems that don't change very much, and these tools have to adhere to these principles. If you understand these principles, you will know how to create well designed systems regardless of which tools are currently in vogue.
As a system designer, your goal is to decide which technology is appropriate for which purpose and know how different tools can be combined to form the foundation of well architected applications. Once you are able to do this, you will have a good intuition for what your system is doing under the hood. That's how you will be able to make good design decisions and create robust distributed systems.
4. Goals of a Distributed System
There are three major goals of a distributed system that a designer should always have in mind while architecting such a system: Reliability, Scalability and Maintainability. [2]
a. Reliability
Reliability of a system is defined by how it behaves when faced with things going wrong. This could mean hardware components failing, or entire server nodes failing or any software components that cause cascading failures in multiple nodes or processes.
A distributed system is expected to gracefully recover from such failure and provide a consistent level of service to it's users.
Designers use different tools to ensure a system is as reliable as possible to these different challenges.
To ensure the system is tolerant of hardware failures, designers use a combination of underlying tools that have hardware tolerance built in, along with distribution of application code onto multiple machines. If any one node fails, then client requests are forwarded to nodes that are still online, thus providing end users service in the face of hardware failures.
For software errors, it is recommended that designers thoroughly test the various components of the application by executing unit tests, system integration tests and manual test to ensure the application is tolerant of software bugs.
In addition to hardware and software issues, we should not overlook the human element. Humans make mistakes, so a system should either be tolerant, or have the ability to quickly reverse human errors. Implementing a detailed and clear monitoring ability can help operations teams to look for early warning signs of system issues and take steps to resolve them.
The system should not attempt to, nor can it be designed to prevent issues. A good slogan for a good system designer is the philosophy behind Erlang: "Let it fail". I'll have more to talk about Erlang and Elixir later on in this series.
If a system is design to run reliably, it can transparently distribute load to multiple hardware and software resources thus fulfilling the first goal of Reliability.
b. Scalability
Scalability is the ability of a system to handle increased load on it's resources. Most often, the way to handle increased load is to scale the system. Scaling can be done in two ways: Scale up, or scale out.
Scaling up has been the traditional response to increased load. It involved purchasing a more powerful machine with faster processor, perhaps one with mulitple CPU's and cores, along with larger memory, maybe one with a high speed SSD.
Scaling up, however can become very expensive very quickly and does not guarantee scalability if the load increases even more.
Most modern distributed system today are designed to scale out. This gives the designers the ability to add more machines to the system to handle the higher load.
In order to figure out the architecture of a system that can scale out, we must quantify what increased load actually means. Load could mean different things in different contexts. For a webserver it could be requests/second, it could be the ratio of reads/writes in a database, the number of simultaneous active users in a system, or a hit rate on a cache.
Some of the questions we can ask to reason about load and scalability are[2]:
1) When you increase a load parameter and keep system resources unchanged, how is the performance of the system affected?
2) When you increase a load parameter, how much do you need to increase resources if you want to keep your performance unchanged?
The only way to ensure that scaling a system leads to an improvement in the load parameter is to constantly monitor the relevant load parameter in your system. The relevant parameter is going to be specific to your system, but you have to ensure that you monitor it over time so that you get an idea whether scaling the system actually had any improvement.
The various techniques that we have available to us as architects will be depdendent on the specific application we are trying to build, however scalable architectures are built from general purpose building blocks arranged in familiar patterns [2].
c. Maintainability
Now that our system is both reliable and scalable, it won't do us any good if we cannot maintain it properly.
In this context, maintenance involves three different characteristics of the system: operability, simplicity and evolvability.
i) Operability
Operability essentially means to make the system easy to operate. The operations team should be able to monitor the system resources and scalability parameters in real time and react to any systemic issues that may arise due to unforseen spikes in load.
ii) Simplicity
The system should also be easy for new engineers to understand. This is done by removing as much complexity from the system as possible. In addition to reducing complexity, de-coupling the various system components also helps to reason about the overall system. Doing so abstracts away complexity within the various sub-components of the system.
iii) Evolvability
The system should be easy to update in the future as requirements change. The system should be adaptable to changing requirements and importantly to higher levels of usability.
Simple and easy to understand systems are usually easier to modify than complex ones.
6. Conclusion: Why should you care about Distributed Systems.
Whether you like it or not, whether you know it or not, you are building distributed scalable systems.
Are you a Business Intelligence Analyst or developer? Guess what? When you create BI solutions for teams located in multiple timezones with application servers in different locations, you are creating a distributed system. You might not use the same open source tools that power the largest systems, your system still has some of the same requirements even though your client base is not in the millions.
And if you are a BI Analyst, most of the complexity is abstracted away into the vendor products that are ubiquitous in enterprise today.
However, the largest distributed systems that power the most popular apps today are not turnkey vendor products. They might use sub-components from various vendors like MongoDB, Redis Labs, etc to build the system, but no one vendor can support ever changing set of end user requirements. To do that you the system architect need to design the system in ways discussed above using mostly open source tools that form the backbone of these systems.
Once you truly understand the design principles behind the various tools and technologies and some of the design patterns that power these systems, you may be well on your way to designing the next billion dollar idea.
7. References
- van Steen, Maarten, Tanenbaum, Andrew S. (2016) A brief introduction to distributed systems.
- Kleppmann, Martin (2017) Designing Data-Intensive Application, First Edition