Simplifying DevOps Monitoring with OpenTelemetry

OpenTelemetry

As part of Solutions Review’s Premium Content Series – a collection of columns written by industry experts in mature software categories – Logz.io’s Dotan Horovits provides an in-depth look at OpenTelemetry and how it can help simplify DevOps monitoring.

SR Premium Content

“Microservices” is the new standard for building products today. It brings many benefits in speeding up development, but it also adds complexity to monitoring systems. These systems are often multilingual and use multiple programming languages, each with its own way of monitoring. Additionally, today’s systems make extensive use of third-party tools and frameworks to speed development and focus the engineering team on the business differentiators. These third parties can be open source projects, proprietary tools or cloud services. And while we didn’t write these frameworks, we still need to monitor them effectively to achieve end-to-end observability across our system.

Observability is the paradigm that allows us to monitor and understand our systems. The formal definition, taken from control theory, discusses “a measure of how well internal states of a system can be inferred from knowledge of its external outputs”.

In this article, I want to take a look at the different types of observability data we need to collect and how the collection process can be simplified using the open source project OpenTelemetry, a fascinating project from the Cloud Native Computing Foundation.

OpenTelemetry: DevOps monitoring simplified.


observability signal types

Each system emits different telemetry signals that help us understand its internal state. The classic signs are logs, metrics, and traces, often referred to as “the three pillars of observability.” Logs and metrics have been around for decades, while distributed tracing is relatively younger but gaining momentum. Other signal types arise, e.g. B. continuous profiling, but these are much earlier in the adoption cycle. Also, the observability data comes from a variety of data sources, both in the application and in the infrastructure. A typical system may consist of a Python frontend, a Java backend, some databases, and a messaging service like Kafka (or its cloud service equivalents), each of which emits its own logs, metrics, and other telemetry that is ingested have to.

Collecting such heterogeneous data in a consistent manner has been a challenge for many years. Vendor proprietary solutions have not been able to keep up with the ever-expanding ecosystem of third-party tools and advances in programming languages. Additionally, each tool and vendor had its own proprietary APIs and SDKs to instrument the application code, proprietary agents to collect and process the data, and a proprietary protocol to transmit the telemetry to the analytics backend. This effectively created data silos that prevented full observability. We had to take a different approach and consider observability as a data analysis problem.

Unified data collection with OpenTelemetry

The open source community that brought us Kubernetes and the cloud native ecosystem has also provided open source tools and standards to monitor them. An important project under the Cloud Native Computing Foundation (CNCF) is OpenTelemetry, an observability framework that helps generate and collect telemetry data from cloud-native software.

Let’s take a closer look at what the OpenTelemetry framework offers us. For each programming language, OpenTelemetry provides a single API and SDK (i.e. client library) to instrument the application. It also provides a unified collector that can collect telemetry data from multiple sources, whether your application or infrastructure components, over different protocols. The Collector then processes the telemetry data and exports it to any backend observability analysis tool or downstream system via multiple protocols. It is important to note that the OpenTelemetry project does not provide or participate in the analysis backend. Last but not least, OpenTelemetry offers a unified protocol, OTLP, for transferring logs, metrics and trace data.

OpenTelemetry is not another open source project. In fact, it is the most active project in the entire CNCF after Kubernetes. All monitoring and observability vendors, as well as the major cloud providers, have started to join OpenTelemetry, even at the cost of hiring lucrative proprietary agents. This is a positive sign that OpenTelemetry is on the way to becoming the new de facto standard for telemetry data collection. Gartner’s most recent Hype Cycle for Emerging Technologies (2022) even listed OpenTelemetry as an Innovation Trigger and estimates that it will reach the plateau in Gartner’s Hype Cycle within 2-5 years.

OpenTelemetry is a relatively young project, but it is already generally available for use with distributed tracing telemetry data and is in the release candidate stage (the step before general availability) for use with metrics data. The least developed signal are protocols that won’t be widely available until 2023. The OpenTelemetry roadmap also looks at additional signals beyond the “three pillars”, with continuous profiling being the first.

Getting started with OpenTelemetry

OpenTelemetry is not a single monolithic project, but a collection of projects run by multiple working groups. This can make starting with OpenTelemetry quite confusing. Knowing your tech stack is important when getting started with OpenTelemetry.

Start with these four questions, which will guide you to the components relevant to your system:

Which programming languages?

What programming languages ​​do you use? This will determine the OpenTelemetry APIs and SDKs that are relevant to you, and possibly also auto-instrumentation agents. Take it a step further and determine the programming frameworks you use with each language to see what integrations exist for them.

Which signal types and protocols?

Next, determine what observability signals you want to collect to determine the relevant collector receivers you would use. Start by finding out what types of signals are of interest to you under Traces, Metrics and Logs. Also see which telemetry protocols you should support. This is especially important in brownfield projects where existing and potentially obsolete components are already sending out telemetry in specific protocols that you must adhere to.

Which infrastructure components?

Next, list what sources the signals are, i.e. the components you are monitoring. Many infrastructure components have their own formats and use specific receivers to accommodate them. This applies to open source tools such as Kubernetes, Kafka, MySQL and HTTPD. This also applies to cloud services like AWS X-Ray and GCP pubsub or even existing telemetry collectors like CollectD or StatsD.

Which backend analytics tools?

Lastly, you need to define which tool stack you want to use to run analytics on your telemetry data. It can be an open source tool, a proprietary tool, or a cloud service. It could also be another downstream system that receives and processes the data. This will help you determine the relevant collector exporters you would use.


final grade

OpenTelemetry is a young but promising project that aims to become the de facto standard. It also carries the promise of uniform observability on the side of generating and collecting the observability data. Having this project under the wing of the Cloud Native Computing Foundation alongside Kubernetes, Prometheus and other leading projects in this space will further facilitate collaboration to ensure compatibility across the stack.

Dotan Horovits
Recent Posts by Dotan Horovits (See everything)