Introduction
2 words "distributed" and "system" meaning a system which is built up of multiple components that work together as-a-whole. For instance, consider an online platform like Amazon. It is made up of various services to handle inventory management, authentication, order processing, and payments and many more. These services must communicate effectively to ensure that users experience a seamless shopping experience. For example, when a customer places an order, the system needs to update the inventory, process the payment, and prepare the shipment. These components may be separated physically across different networks or located on a same network.
System separated Physically vs. Local
Firstly, two aspects to understand: components separated physically and components on the same network.
Physically separated | On a same network (Local) |
Distributed geographically. Meaning some Service A might be deployed in Europe and some Service B might be deployed in Asia. Despite being physically separated, they form a cohesive unit. | Distributed locally. Meaning some Service A and Service B might be deployed within the same network, or a data center to form as a cohesive unit. |
Communication between services occurs over a WAN (wide area network) or over the internet. | Communication between services occurs over a LAN (local area network), an internal network (intranet), or the internet. |
High latency in geographically distributed systems is due to longer distances data must travel. | Lower latency in the same network are due to shorter data travel distances. |
Data consistency is challenging. Keeping all distributed services synchronized is hard. | Data consistency is less challenging. |
Easily scalable. Adding more resources (horizontal scaling) across different locations can handle increased demand, but requires managing network latency and data consistency. | Is scalable, but adding more resources can cause throughput and bandwidth issues when network becomes congested. |
Building a bad distributed system is straightforward, just call an RPC on another service and ignore resiliency, data consistency, fault tolerance, security, and other crucial aspects. However, building a robust, efficient, and reliable distributed system is challenging and complex.
Now its clear about "being distributed", questions like how to establish communication between these components, have data consistency all over the component, handle failures, and many more arises.
Communication
There are several ways to establish communication between services. Effective communication is fundamental to ensuring that the system operates as a cohesive unit. In general, communication can be categorized into two types: synchronous and asynchronous.
Synchronous communication
In synchronous communication, a service makes a request and waits for a response before proceeding. Also know as the request-response pattern. Basically some service A
request to service B
and waits for service B
to process the request. Example using synchronous communication:
private List<string> _database = new List<string>();
public async Task<bool> CreateNewCustomer(string customer)
{
_database.Add(customer);
// also save to service B
using HttpClient client = new HttpClient();
HttpResponseMessage response = await client.PostAsync("https://api.serviceb.com/v1/welcome/sendEmail", new StringContent(customer));
response.EnsureSuccessStatusCode();
return response.IsSuccessStatusCode;
}
In this example, Service A first adds the customer to the local _database
with _database.Add(customer);
and then makes a POST request to Service B to send a welcome email. This is a common practice where, upon a new user's registration, a welcome email is sent to enhance user engagement.
Although this function is async
, the interaction between Service A and Service B is synchronous in practice. When await
is used to POST request service B, it pauses the execution of the CreateNewCustomer()
(but doesn't block thread) until the awaited task completes (more on async await). During this pause, control returns to the calling method, and the thread is free to perform other operations. This means Service A waits for Service B's response before moving forward. The function does not complete until the response from Service B is received, effectively making the interaction synchronous.
What ifs ...
Service B is slow or down? Service B might be down for maintenance.
Service B throws an error or Service B throttles due to high number of users registration. Explanation: If a large number of users register, Service A processes the registrations, but when it tries to send welcome emails, Service B might throttle the requests due to high volume. A common approach to manage email sending is through queuing. However, queues have limits and can only handle a certain number of requests at a time. If the queue is full, additional requests may be delayed or rejected, leading to throttling issues.
After sending welcome email i would also like to include them in some sort of marketing camp. Adding another call to Service C?
- What if Service A (creating customer), and Service B(sending welcome email) succeed but service C (marketing camp) failed?
Asynchronous communication
In asynchronous communication, a service makes a request but doesn't wait for the response. It just moves forward. Basically some Service A requests Service B without waiting for Service B response.
Some ways to achieve asynchronous communication are:
Messaging
Basically sending data in a non-blocking way using some sort of queuing system. A message queue can be defined as a process where messages (data) are stored in a queue and then some other services processes it. Working mechanism: Some Service A sends data (a producer) to a message queue, then some service B processes (a consumer) the queue accordingly. The queue acts as a buffer which holds messages until they are processed by a consumer (or delete after some time). Messaging solves a crucial issue we had in synchronous communication, i.e. what if some n number of services are added to increase user enhancement:
- After sending welcome email i would also like to include them in some sort of marketing camp. Adding another call to Service C?
Using a message queue, any number of services can consume the messages from the queue. Queues can further be categorized into topics. Topic helps to categorize messages(data). Messaging can be visualized like so:
In summary, messaging with queuing technology provides a robust and resilience way to achieve asynchronous communication.
Pub-Sub pattern
The publish-subscribe (pub-sub) pattern is another asynchronous way to commute between services. Simply understanding pub-sub, a service (publisher) publishes a message and other services (subscribers) interested in that publisher receives that message. This is very similar to YouTube channel where user can subscribe to their favorite creator. User will be notified when their fav creator publish a video.
Publisher: Service which sends a message to a message broker. Publisher sends message without knowing who will receive them.
Message broker: Sorta middleman whose job is to distribute message from Service A to Service B or from Service A to multiple services. It ensures messages are delivered to all interested subscribers.
Subscriber: Service which receives message through a broker.
Pub-sub pattern can be visualized as below:
Pub-sub using Web sockets
One way to implement pub/sub pattern between services is to use web sockets. Web socket provides a real time, point to point communication channel in a long lived connection between services. Simply understanding, using web sockets allows a service to send messages to other services instantly and continuously without reconnecting every single time. There are 2 key components in web sockets: publisher and subscriber (same as above). The publisher broadcasts(send) messages to all subscribers connected to the Web Socket, allowing for immediate and continuous communication.
Pub-sub pattern using web socket can be easily implemented in .NET applications using SignalR. (I'll go more into this later). SignalR simplifies working with web socket and provides high level abstraction to jump right into logic without worrying about low-level details.
Pub-sub using web socket can be visualized as below:
So basically, Pub-sub using web socket (SignalR) enables direct and close to real-time communication between services (publisher and subscribers).
Event-driven approach
Event-driven approach (or Event-driven Architecture) (EDA) is an pattern where systems operate based on some events. EDA can be explained using a simple example of vending machine. Lets say you want to get a drink (Coke) from a vending machine:
You select your choice of drink (i.e. Coke) (
ItemSelected
)Then you make a payment using some sort of online payment vendors or cash
PaymentCompleted
The vending machine verifies the payment (after
PaymentCompleted
event).If payment was successful, the vending machine will dispense the drink. (
DrinkDispensed
)
Each action generates an event. Generated events trigger more events. Events drive the flow of the system.
A key pattern to observe in above bullet points, all events are past tensed. This is because an event represent an action or changes that has already occurred in the system. Event cannot be ItemSelected
or CompletePayment
, or DispenseADrink
, etc.
In EDA, a system is designed to respond to actions that have already occurred and perform subsequent operations based on these actions. The ordering of events is critical, as the sequence of actions determines the flow of the system. Additionally, ensuring transactional integrity is essential to maintain consistency and reliability within the system. System can be "transactional" by using Saga pattern, and event sourcing pattern (More on this later).
If i had to design a software system for vending machine:
Here,
Event(Service) Bus : Routes events between different services. It ensures events are properly delivered to handle them. This can be a Service Z.
Service A : Can be a frontend where user will select their choice of drink. This will generate
ItemSelected
event.Service B : User makes a payment for their choice of drink. This will generate
PaymentCompleted
event.Service C : User will receive their choice of drink. This will generate
DrinkDispensed
event.
That's how a typical EDA can be explained but this is not perfect for real world scenario.
What ifs..
In Service A, there is no coke (selected drink)?
Payment failed?
DrinkDispensed
occured beforePaymentCompleted
or inappropriate ordering of events? (more about Event Sourcing in later articles)Some services are unresponsive.
Duplicate events.
DrinkDispensed
occured twice (well, it can be. What if the user paid for 2 drinks? But what about ifPaymentCompleted
occured twice?).Loss of a event. User paid for a drink but no response. Why? Services being unresponsive, loss of a event in-between?
In Event-Driven Approach (EDA), it is crucial to conduct thorough requirements gathering to identify all possible actions and events. Implementing robust error handling, ensuring transactional integrity, and incorporating fault tolerance strategies are essential for maintaining system reliability and efficiency in real-world scenarios. These measures ensure that the system can handle unexpected situations, such as errors, service unresponsiveness, and event inconsistencies, effectively.
Summary
In this article, I covered the basics of distributed systems, including the differences between geographically and locally distributed systems. Effective communication between services can be synchronous or asynchronous, with methods like messaging and pub-sub patterns enhancing decoupling and resilience. The Event-Driven Approach (EDA) was explored as a method for handling operations based on events, emphasizing the need for thorough event handling, error management, and maintaining transactional integrity. Proper implementation of these concepts is essential for building reliable and efficient distributed systems. Stay tuned for more detailed discussions on each topic.