Serverless Diary: How to Design Fail-Fast Architectures using Circuit Breaker
Introduction
In my previous blogs, I have discussed various scenarios like integration with different AWS services and 3rd party applications, which may or may not be hosted on AWS public cloud. One common theme in all serverless designs is every service that interacts with another service whether internal or external is a remote call. Remote calls are prone to slow responses, timeouts, and failures.
The focus of this blog is to discuss an approach to handling various responses to these remote calls and see how we can use circuit breaker patterns in a serverless world with AWS Lambdas.
I. The Challenge
The COVID 19 pandemic is impacting everyone’s life. Let’s consider a scenario that the majority of us can relate to — “As a citizen, I want to book a COVID test nearest to my home.”
Step 1: Consider a user on the website who wants to book an appointment for a COVID test and hence is checking the availability of the nearest available walk-in test center he can go to. He makes a request to view the list of nearest test centers showing availability.
Step 2: The request goes via the Public API gateway to the appropriate lambda responsible for executing the requested business logic.
Step 3 & 4: The lambda queries data store (Memcache or DynamoDb) to check test centers with available capacity that can allow appointment booking. Once the lambda retrieves the available test center it sorts and filters the test centers as per travel time (time required to travel from user location to test center location) by using a 3rd party geolocation service as highlighted in Step 5
Step 6: The lambda sends back a well-constructed sorted response list of the nearest available test centers where appointments can be booked by the user.
Let’s summarise the challenges and risk the ‘As Is’ design poses, focussing on Step 5:
- What happens when the call to 3rd party times out due to problems at the service provider end?
– Bad user experience. The user receives a delayed response, only to see a technical/time-out error.
– Waste of computing power, hence cost. Given we are working with serverless architecture, the lambdas would try to compensate for delayed responses by spinning more lambdas than usual to address concurrent requests, resulting in more cost with no benefit as the end result will be a timeout. - What happens when 3rd party is working at maximum capacity and cant serve all requests?
– HTTP 429 Response. The standard approach to handling 429 for calling applications is to apply a back pressure by applying various strategies like exponential backoff. This strategy allows you to not bombard your 3rd party/critical service with a constant load. But this approach alone isn’t acceptable for critical 24/7 applications. In such cases, you want to apply a fallback strategy that allows ur critical application to still function even though less optimally. For our example, we can quickly fall back to less optimal distance calculation which is based on geo-coordinates(instead of time) using the Haversine formula. The user can still book a test, only sub-optimal point is that distances from home won’t be as realistic as with actual travel time. - How do you track and monitor SLAs that you are responsible for and differentiate them from any 3rd party outages?
– Fail fast. Logging and fail fast techniques help to respond back quickly, helping with the user experience and also tracking metrics when 3rd party service provider is experiencing an outage. This approach coupled with the previous point helps ensure your critical services have a high availability.
II. The Solution
We can use ‘Circuit Breaker Pattern’ to address challenges and issues highlighted in the previous section.
The above implementation extends AS-IS design to wrap a protected function call in a circuit breaker object (refer to step 5), which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips and the library calls the fallback function. The above logic assumes an exponential backoff logic is applied to try the protected function again.
Let’s look at step 5 in more detail. Given we are using lambda functions, maintaining a state in memory won’t fit our use case, as it may for the majority of the monolith applications. The circuit state needs to be persisted externally to lambda functions. This state should be accessible in a distributed network architecture, and provide strong consistency. ElasticCache(Redis) or DynamoDB with DAX can be used to maintain circuit breaker state with minimal impact to performance, given both these stores allow ultra-high speeds.
Following is a self-explanatory flow chart that expands on step 5 implementation logic. It demonstrates the management of standard circuit breaker states, that are OPEN, CLOSE, and HALF_OPEN.
III. 3 Key Takeaways
- A Circuit Breaker pattern is applicable for internal and external integrations. Hence, makes sense to create this as a ‘re-usable library” or “lambda layers”.
- Ensure all your egress calls fail-fast in case of a network/timeout issue. For example have a 3-second timeout on egress calls (or an apt number as per the service provider’s SLA), so that threads are not unnecessarily occupied in case of downtime or performance issues with the service provider application.
- Where possible, have a fallback functionality in place for critical applications. This increases the availability of your applications, even when your service provider is experiencing downtime. If that’s not possible, have a mechanism in place which gracefully offers degraded functionality, rather than impacting the entire website or application.
If this post was helpful, please click the clap 👏 button below a few times to show your support for the author 👇