“Without you logging, I will be lost in this distributed universe
- love: serverless architecture”
Serverless architectures are the current trend and push to abstract application design and development from infrastructure issues. I have touched upon numerous benefits of serverless designs in my previous posts. As with all good things in life, nothing good comes without its own set of challenges. Logging in the serverless world is one such complex challenge. This article provides guidelines and an efficient approach to distributed logging for serverless architectures.
3 Reasons — Why logging
- Well-designed serverless architectures tend to use a combination of both synchronous as well as asynchronous and event-driven patterns. When these systems are operating at scale, you have hundreds of events firing. In such scenarios, identifying and logging the right amount of information from individual components is essential to building meaningful metrics and monitoring the health and performance of the system as a whole. For example, let’s assume one did an excellent job and designed a perfect architecture where individual components (assume AWS Lambda) providing critical functionality also have a fallback mechanism triggered via a circuit breaker pattern. One surely wants to be notified how long the fallback mechanism was in action and how soon the system recovered. I’ am sharing this example to highlight that meaningful logging is essential even for such cases when the system appears to be working fine, although in reality, there is some form of graceful degradation that has happened, and that won’t be necessarily noticeable until a few hours or days. Good logging (with alerting) lets us learn this behavior in near real-time.
- Many enterprises have adopted a multi-cloud strategy, which allows them to overcome the limits of redundancy, scale, cost, and features in a single cloud provider. Running in multiple clouds requires centralized visibility and control. Distributed logging design using serverless components across the various public clouds is fundamental to achieve that centralized view.
- Logs are one of the 3 Pillars of Observability. (the other 2 being metrics and traces) Good logging design is key to making your systems more observable.
3 tiers - Log Management Infrastructure Architecture
A good log management infrastructure architecture typically comprises the following three tiers:
- Log Generation.
This tier contains the serverless components that generate the logs. Example AWS Lambda, API Gateway, Cloud Trail, etc
- Log Analysis and Storage
This tier contains the serverless components responsible for receiving log data, performing transformation (if required), and forwarding these in near-real-time to a storage service. Example: AWS Cloud watch, Kinesis firehose, and S3 bucket.
- Log Monitoring
This tier provides visualization tools to monitor and review log data and the results of automated analysis. Usually, for this tier, many enterprises prefer to use 3rd party tools like Alert Logic, Splunk, Kibana, etc. ELK stack is a good alternative to Splunk if you are using AWS as the only cloud provider.
3 guidelines - What to log
While there is no one size fits all solution, as a guideline log:
- Any Failures.
These typically include Application and System errors(syntax and runtime errors, connectivity and performance issues), Input/Output validation failures (protocol violations, invalid parameters), Authentication, and Authorization failures.
- Selected Successful Events
You want to be careful here, as this is the area where you can log more than what is required. Depending upon the business and the security requirements, one may want to log data like Authentication successes.
- Statutory or regulatory activities
These must be identified and be proportionate to the business and security risks and threats. example- access to a restricted system or functionality
3’s in action
Now that we understand the “Why” & “What” of logging, Let’s move on to “How” to design a logging strategy.
- Several **AWS services can publish logs directly to cloud watch. For others, AWS CloudTrail collates and stores Application Programming Interface (API) data from AWS services within its scope and then forwards API data to AWS CloudWatch, together with other metrics.
- AWS CloudWatch streams data into Kinesis Firehose. Metrics are also made available using the cloud watch data.
- Kinesis firehose then pushes data into the s3 bucket. For old data required only for compliance purposes, It’s recommended to us s3 lifecycle policy to change the storage class to Glacier.
If you are using 3rd party product like Splunk then lambda configured with firehose can wrap it up as a Splunk HEC Event in JSON format and push it to Splunk.
- 3rd party products can now access AWS data via IAM role like Grafana can access cloud watch metrics, Alert logic can access data in s3.
Above is just one of several ways of making AWS logs and metrics available for monitoring. As stated in my earlier blog on integration, choosing an appropriate integration style should ideally be driven by the existing solution landscape.
- Use Correlation Ids
Correlation IDs tag every log message with the relevant context and make them easier to find later on. As the system scales to hundreds of lambdas, this becomes quite critical. It’s also a good idea to include the X-Ray trace ID in every log message. One can use this to quickly load up the relevant trace in the X-Ray console if using lambdas. This approach even provides tracing if you are on multi-cloud as you have a correlation id to relate an entire thread of messages across multiple cloud components.
- Write Informative Structured Logs
Write your logs as structured JSON from the start. CloudWatch Logs (and other 3rd party tools)understands and parses JSON. That will provide you a lot of power to filter and analyze the logs.
- Be mindful of excessive logging
Serverless architectures are prone to creeping costs around logging and storage. So don’t blindly log everything. It all adds up pretty quickly when your production load increases suddenly or over time. For example- Collect (Data Ingestion) to CloudWatch costs $0.50 per GB. So assuming you have a busy system that generates 10GB of logs per day, you are already looking at a monthly cost of $150 alone from data ingestion. Also, it’s a good idea to archive old logs to cheaper storage like S3 Deep Glacier to reduce costs.