Improve resilience and scalability with event driven architecture

What is event-driven architecture?

Event-driven architecture is an alternative to request-response architecture and allows services to be more loosely coupled.

In a request-response architecture, a consumer service will send a request in a specific format to a provider service and wait for it to respond in a specific way. This means the consumer service has to be somewhat coupled with the provider service to ensure they send well-formed data that the other service expects.

In an event-driven architecture, one service will publish an event and many other services can then consume that event, but those consumer services won’t be able to respond directly to the service that published the event.

How can an event-driven system improve resilience?

Let’s look at an example scenario where an event-driven system will improve resilience.

Here’s a request-response example of an order being placed on a marketplace application.

Request response architecture flow chart

In this example, you can see that the API has many jobs to do, it:

validates the request,
saves the order,
reduces stock levels,
calls another service to send an order confirmation notification,
calls another service to notify the seller of the order,
calls another service to update the customer's segmentation data, and
then responds to the client

Having multiple jobs to do means the API requests will take longer which can frustrate customers and reduce their confidence in your application. There’s also more chance that something could go wrong. If something does go wrong during a request, what state will the order be left in? and what will we tell the user?

Let’s imagine that there’s a problem with the service we use to notify sellers in this scenario. Here are a few ways we could handle this scenario (not exhaustive):

We don’t catch or handle the error, we just return a 500 error to the user letting them know the order failed. But we’ve already updated the database and sent a notification to the buyer.
We could wrap the database queries in a transaction, and only commit them when everything else succeeds. But this means we could be rejecting orders because of an issue with a third party, and we’ve already sent a confirmation email to the customer.
we could catch these errors and ignore them, and the order will still succeed but we might have failed to send some comms to the seller.
we could code some retry logic to keep trying until the notification succeeds but this will slow down the APIs further and still isn’t guaranteed to succeed.

Now let’s take a look at the same scenario with an event-based system.

Event driven architecture flow chart

Here you can see that the API:

performs the validation,
saves the order,
reduces stock levels, and
publishes an event to an event bus

This is the bare minimum that this API needs to do to be able to tell the client that the request was successful. The other actions can be processed in the background in their own time.

Once an event is published to the Event Bus, many subscribers can consume that event and perform additional actions, in our case:

Send order confirmation
Notify the seller
Update customer segmentation

In this event-based system, additional subscribers can be added at any time without needing to alter the API.

Like in our request-response example, there could be a problem with the service we use to notify sellers. With our event-based architecture, we don’t have to reject the order, we’ve already stored the order details and responded to the client. Depending on the Event Bus technology you’re using, you can configure retry logic with various back-off configurations to ensure your application will continue to retry the event until the service succeeds. You can also configure your message to land in a dead-letter queue and alert the product team so they can investigate the issue and replay the event once the issue is resolved.

How can event-driven systems improve scalability?

Event-driven systems make it easier to scale your applications and user base without degrading performance.

You can easily add additional event consumers to perform other side effects or analysis without having to change the client-facing services. Each event consumer can also emit any number of other events if you've got a new use case to add.

As your user base grows, event-based systems mean that there's a much lower impact on performance because a lot of the heavy lifting is being done asynchronously in the background while the user continues their journey. As your user base grows, that might mean that your event consumers are unable to process events as quickly as before, but that's generally OK as the events will simply be queued up and processed eventually.

Event-based tools

Some tools and services you can use to build event-based systems.

In GCP the cloud-native managed services are:

In AWS the cloud-native managed services are:

SNS (Simple Notification Service)
SQS (Simple Queue Service)

Probably the most popular non-cloud managed alternatives are:

Summary

It doesn’t have to be one or the other when it comes to event-based architectures versus request-response. Use the right mechanism for the job. If you need to request another service and you need the response, then use a request-response API call. If you can happily fire-and-forget then use an event-based system.

Event-based systems are not only more performant and resilient, but they’re also much easier to debug and recover when things go wrong.

Written by Rob Caiger

Co-founder, Engineering Lead

21 Dec 2022