Serverless Diary: What You Need To Know About Message Replay Strategy

5 min readOct 24, 2021

Original Photo by **Brett Jordan** from **Pexels**

I. Introduction

Avoiding ‘Single Point of Failure’ in your architecture design is a well-established best practice. A good design assumes every point in architecture design is prone to failure. This concept is equally important in the modern serverless era where microservices architecture has more self-independent moving parts (separate process, network, physical components, etc) than there used to be in traditional n-tier architecture.
The focus of this blog is to understand a common yet essential concept and component in message-driven or event-based architecture, which is Dead Letter Queue (DLQ).

II. About DLQ

The DLQ is the queue to which messages are sent if they cannot be routed to their correct destination. It is, sometimes referred to as an undelivered message queue and is a holding queue for messages that cannot be delivered to their destination queues, for example, because the queue does not exist, invalid payload, the queue is unreachable, permissions issues, etc. It allows you to set aside and isolate non-processed messages to determine why processing failed. This is particularly important for the Operations team as depending upon the cause of failure, it may be required to replay the messages or to delete them. For Microservices and serverless style architecture, the use of queue as a DLQ is a known pattern. I wanted to share some ideas on how to minimize manual intervention as much as possible to replay or not these failed messages using AWS as the reference public cloud provider. In the world of AWS, a DLQ role is played by AWS SQS.

II. The Challenge

The standard queue replay strategy of replaying messages x number of times before being moved to a DLQ doesn’t necessarily cover all the use cases. What if a message needs to be re-tried after 6 hours or 12 hours or more? Good architecture considers all possibilities. For a native public cloud solution few of the factors and use cases to consider and design for:

The public cloud provider has a regional outage for a given service that lasts perhaps longer. For example issues with AWS Lambda service, SQS service, etc
The downstream system you are publishing messages to is unavailable over a weekend (assume no out of office hours support over the weekend)
Few near real-time services like AWS DynamoDB (kinesis data) streams, it is advised to keep retry numbers small as streams capture a time-ordered sequence of item-level modifications. As a consequence, if one message within the shrad is blocked, so are the messages that follow until the problematic message is delivered or expires.

III. The Solution

Let’s look at 2 common ways in which a message can end up in an SQS acting as DLQ.

a. Failed to Process DynamoDb Streams successfully

Processing and publishing messages from dynamodb streams in near real-time is a well-known serverless pattern (Refer Step 1 & 2 of figure 1). In this scenario, we must configure an on failure destination for Router Lambda as an SQS queue for failed messages, along with a maximum retry configured for a failed message. This ensures that messages in the same dynamodb shrad are not blocked in case of a poison message or because the downstream component or system is unavailable.
Let’s assume a few messages end up in SQS queue dlq-db-queue because Router lambda was unable to process or deliver messages downstream after trying failed messages “X” times. In such a scenario, one useful approach is to have a cloud watch event (refer to Step 6), either scheduled (say every 4 hrs) or on-demand. This will trigger the Router lambda to read messages from the dlq-db-queue and then use the metadata from the messages to retrieve messages from the dynamo stream and re-run processing logic. Set the expiry (<=24 hours) on queue appropriately, so that after few runs of cloud watch events (if scheduled every 4 hours), when we are sure messages delivery isn’t possible, messages are deleted automatically from the queue. Alternatively, the cloud watch event can pass “action=delete” for an on-demand manual run, so that custom logic in lambda can take that as an indication to delete messages.

b. Failed to Process SQS Queue message successfully

Yet another common pattern of processing SQS messages is demonstrated by step 3 which receives the message from the ‘Queue’ and attempts processing and publishing it to the downstream system (Application A and B). If there is an error during processing or publishing events to the downstream system, then after the configured number of retries on the SQS, the message automatically moves to the corresponding DLQ (dlq-queue). The retry strategy here is similar to the one discussed in the previous section. A cloud watch event (scheduled or on-demand) passes a JSON payload to the Queue-receiver lambda that reads the messages from the dlq-queue and puts them back on the main queue, which re-triggers the usual flow as if this was a new message on the queue. This approach is straightforward compared to the replay mechanism discussed for failed dynamo stream messages.

IV. 3 Key Takeaways

The responsibility for the lambda receiving events/messages from Dynamo streams should be minimal. It should mostly handle conditional forwarding logic to a queue that is more reliable and provides more benefits. A poison message in a stream has the potential of blocking all other messages in the same Shrad, so do ensure you configure an appropriate retry number that isn’t very high but gives enough time to cater for any temporary glitches in the downstream systems.
The message retention of the dynamo stream is 24 hours. Hence the above strategy of cloud watch for replays is good for 24 hours only. Hence the expiry of the corresponding SQS acting as DLQ should also be kept as 1 day, so old messages are deleted on time to avoid unnecessary retries. If your use case requires a longer retry window than a day, I would advise extending the pattern I shared. Reconstruct the message from the stream and put it on another queue with the actual message payload (and not the original dynamo message with stream metadata) so that messages can be held and retried for a longer period. The trade-off is an additional queue and more custom lambda logic to write and maintain.
Use one of the 2 strategies to delete messages from the DLQ. Either set an appropriate value for how long a message should be retained (default is 4 days at the time of this blog) or as discussed in this blog, design the lambda to handle both “replay” action requests as well as “delete” action requests. This avoids any unnecessary creation and running of any operations script when clearing off the queue is required. A tested, repeatable lambda logic to clear queue is safer than a random ad-hoc script.

←Previous Blog
→Next Blog

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author 👇