Kubernetes the Right Way: Observability for Your Microservices
This article talks about observing a large distributed application deployed in a Kubernetes Cluster so that you can find issues faster.
Read on Medium (opens in a new tab)Having proper observability helps you gain a window into the internal state of your application. Historically even with monolith applications, this was somewhat challenging to accomplish. This was not, however, due to the lack of tools and libraries that can do it, but rather due to the lack of knowledge or sometimes due to not prioritizing it in the development. With large microservices deployments, this has gotten worse as you need to have correlated data across microservices to be able to find problems faster.
I am compiling this article based on my knowledge and experience in being a developer for a large microservices deployment as well as a team leader of the end user Observability features of the SaaS application itself. I will be focusing on the important aspects and giving tips to make the system more reliable as well.
Why Do We Need Observability?
Observability of a system is the ability to have data representing the internal state of the system itself. Generally, there are 3 main types of observability data; Metrics, Traces & Logs. All these three verticals add different dimensions to the data being collected and will be useful in any investigation into the system.
With a microservices architecture, this gets more complicated as to perform even one action there would be at least four or five or even more services involved in it. When some action fails it might be due to a failure in either one of these or even it could be due to more than one (or even all of them) failing. When the number of microservices in your system increases, it can get increasingly complex to find the root cause of an issue.
The good news is that most Observability vendors such as New Relic (opens in a new tab), Datadog (opens in a new tab), and Dynatrace (opens in a new tab) already provide some combination of these capabilities and a single unified view. Nevertheless, let’s go in-depth into the best practices themselves so that you know what to look for when you are choosing one of these vendors or implementing it yourself using open-source software solutions.
Data Measurement
Collect Kubernetes Statuses and Events
One of the important aspects which is sometimes overlooked is capturing the Kubernetes pod and container statuses and the Kubernetes events. When you get notified of an issue, the Kubernetes cluster itself would show you the current state of the cluster. However, when you are trying to find the root cause of an issue, the historical data from the time the issue occurred will be quite useful. This is why it is important to capture these data at a specific interval.
Kubernetes pod and container statuses can give you a hint about when the pods might not have been available or even towards moments such as when the pods would have been killed due to reaching the memory limits. The Kubernetes event objects also serve a similar purpose which captures event data that are generated by the various Kubernetes controllers. If you have run a kubectl describe command, you would have noticed these events at the bottom of the result that gets printed. These events however get deleted after a while (generally after one hour). Storing both of these can help you in identifying what occurred in the past giving you a historical view over time.
Collect CPU and Memory Usage, Requests, and Limits
When you are debugging an issue related to an OOM restart or a latency spike, you would need to analyze the used and the allocated memory/CPU. This would require you to have the CPU and Memory usage as well as the requests and limits set by you.
Of course, one could argue that the requests and limits can be found out by reading the Kubernetes manifests. While this is true in most cases, you may still want to publish these data if they are subject to change either by yourself or automatically by a vertical pod auto-scaler. Either way, even if you just collect the resource usage, you would need to record them against the container ID, so that you can check the container resource usage separately.
Add Container ID to Application Metrics
As a developer, you need to add Observability instrumentation into your applications. Frameworks such as OpenTelemetry (opens in a new tab) even have automatic instrumentation capabilities which expose some common metrics and even metrics about method calls of famous libraries. Whatever way you choose to instrument, you will have the ability to add some contextual information.
When your applications are running in a Kubernetes cluster, it is important to add the container ID as one of these contextual information. You could alternatively add the pod name if you have single container pods or if you have another attribute to differentiate the containers within a given pod. This allows us to separately identify which replica of a deployment a particular metric originated from. Especially when you are trying to debug an intermittent issue that rarely happens and only in some of the replicas, having this capability will be quite useful.
Moreover, this allows you to correlate system metrics of containers such as resource utilization and pod statuses to the application metrics time series. This can ultimately help in generating one big picture with all the application and system metrics in one place, which you can use to view aggregated at any level or at the container level. Especially, if you run into issues that have happened in a single pod, this can be quite useful to have in your arsenal.
Add Trace Context to Logs
The metrics we talked about above are generally aggregated by different attributes as well as time buckets. However, logs and traces are data that are generated for a specific request (even though traces might be sampled — check the trace sampling topic below). Therefore, this helps in having a unique opportunity for us to correlate traces and logs together allowing us to gain more insights when they work together.
At the same time adding a trace ID (which is part of the trace Context) to logs allows us to separately identify a single request from another, allowing us to filter logs for a single request at any given time. If you are sampling the traces, it is important to add a trace context to even the requests that are not sampled (without any tied trace data), so that you can still identify the logs for each request separately, even if there are no trace data present for it.
Propagate Trace Context Over the Wire
Another very important aspect of a distributed or microservices architecture is to have proper tracing that is coordinated across all your services involved with a request. To achieve this, you need to propagate the trace context and ensure the following in your tracing configuration.
- All services should use the received tracing contexts and keep the link going by passing the tracing data to all outgoing calls. This tracing context contains a trace ID and it helps in ensuring that all the data can be viewed together.
- The starting service at the edge needs to take the tracing decision and should discard any sampling decisions received in the trace context— This is necessary if you want to avoid an attack where a malicious user would send trace contexts to forcefully record all requests to overload your Observability infrastructure.
- When a request is sampled to be recorded or not recorded by the starting service, the same needs to be done by all the services — otherwise, you will see completely disjoint pieces of traces all around your system.
All of this combined will give you traces spanning across the application end to end.
Use Structured Logging in a Machine-Readable Consistent Format
Another important aspect to consider when you are working on a large microservices deployment is to have a consistent but machine-readable logging format. The general pattern encouraged in Kubernetes is to write all logs to stdout and stderr so that they are recorded in each Kubernetes node in their corresponding logging directories. This is generally collected by a logs agent (such as Fluentd (opens in a new tab)), parsed, and sent to a centralized log collection system, to be analyzed based on the requirements.
As you can understand, without a consistent and machine-readable format, you may end up with a large number of different rules/queries to parse and analyze your logs increasing the overhead on your team. Most loggers allow you to change the format and also have context as key-value pairs, and this should be used from the start, and enforced across all microservices, to avoid any problems in the future when you scale up.
Data Collection
Collect to a Centralized Storage/Service
After recording the Observability data, it needs to be collected and stored in a centralized storage of your choice. You would need to define how you would view them. Many Observability providers have their own views and dashboards with metrics, traces, logs, flame graphs, and many more, which you can sometimes modify as well. Based on what you and your teams are comfortable with, you can choose any one of them, as long as they are in a centralized/correlated environment.
Some providers may not have the capability to correlate between Observability data types (for example logs to traces). However, you should be able to at least search across all the microservices at least and have a single view of your data. Then you can easily view the big picture without having to move between several windows where you may fail to see patterns that are crucial in finding the insights you are looking for.
Buffer at Cluster Level if You Have Multiple K8s Clusters
While not all may run into this level of operations, it is not that rare to have several K8s clusters running in a large-scale microservices deployment. Even without that, if you have a large number of pods within your cluster, it would be better to have a buffering agent in the middle who buffers and possibly aggregates the data before sending them to the final storage.
This can help in reducing the load on the final storage and at the same time allow you to do some post-processing before they move to the final storage. OpenTelemetry implementation offers an agent which can perform this and also receive and send data across multiple formats, called the OpenTelemetry Collector (opens in a new tab), which would be ideal for this use case.
With all these and a proper visualization and alerting system in place, you will be able to handle the issues in your system well. While handling a production incident successfully is out of the scope of this article, using these techniques and having Observability on your system will help you in identifying problems faster and solving them (I recently wrote an article on handling production incidents (opens in a new tab), which goes into this topic in depth).
At the same time, Observability can be one of the biggest costs in your production deployment. So try to have a balance in what you store, while ensuring that you have everything you need (If you are interested, read my article on Reducing the Cloud Observability Expenses and Performance Impact (opens in a new tab)).
Hope you were able to learn something new that will help in improving your deployments. If you enjoyed and learned something from this article, keep an eye out for my next article on Kubernetes the Right Way.
If you liked this article and would love to learn more about best practices on implementing, deploying, and maintaining applications on Kubernetes, read my article series Kubernetes the Right Way (opens in a new tab).