Adventures in Monitoring and Observability in Clouds

This is the part one of a two part blog, in the next blog post we will describe how we are doing observability and monitoring at Public Cloud Managed Services for our customers as part of our Managed Services offerring.

By Chetan Goswami, DevOps Engineer

10.02.2022 00:00:00

The Prelude

Our customers are depending more and more on distributed architectures to consume application services. There has been a steadily increasing diversification of infrastructure and applications in terms of where they are hosted and how they communicate. This has been enhanced in the last few years, because of the great strides made in the ease of adoption of clouds. On one end of the spectrum, customers choose the location and category of their resources based on prices, features and compliance requirements within the same public cloud vendor; on the other hand, the turn towards multi-cloud or hybrid cloud architectures to drive their digital transformation, enhance agility, switch from CapEx to OpEx, etc.

These trends have prompted advances in both observability and monitoring. Exactly, though, what is monitoring and what is observability? Yes, they are very different, and it annoys me when I hear or read about monitoring being replaced by observability; it is not, in-fact observability enhances and amplifies monitoring. Let me explain a bit.

The Contradistinction

One of the best explanations about monitoring and observability I’ve read was provided by Morgan Willis, a Senior Cloud Technologist at AWS.

“Monitoring is the act of collecting data. What types of data we collect, what we do with the data, and if that data is readily analyzed or available is a different story. This is where observability comes into play. Observability is not a verb, it’s not something you do. Instead, observability is more of a property of a system.”

According to this explanation, tools such as CloudWatch, Azure Monitor, X-Ray and App Insights can be classified as monitoring or tracing tools. They enable us to collect logs and metrics about our system and send alerts about errors and incidents. Therefore, monitoring is an active part of collecting data that will help us assess the health of our system and how its different components work together. Once we establish monitoring that continuously collects logs, system outputs, metrics, and traces, our system becomes observable.

My understanding of what monitoring means has been continuously evolving during my career; currently I like to think of monitoring as the data ingestion part of ETL (extract, transform, load). Meaning, you gather data from multiple sources (logs, traces, metrics) and put them into a data store. Once all this data is available, a skilled analyst can gain insights from that data and build beautiful dashboards that tell a story that these data convey. That’s the observability part — gaining insights from collected data, and observability platforms such as Elasticsearch Kibana play the role of a skilled analyst. They provide you with visualizations and insights about the health of your system.

The anatomy of Observability

Observability, which originated from control theory, measures how well you can understand a system’s internal states from its external outputs. Observability uses instrumentation to

This is telemetry, not observability.

provide insights that aid monitoring. In other words, monitoring is what you do after a system is observable. Without some level of observability, monitoring is impossible.

This is telemetry, not observability.

Very often, I notice the tendency to confuse observability with telemetry. Or, at least, with loosely integrated UIs built on top of telemetry silos as shown in the diagram above. In this formulation, observability is somehow explained as the mere coexistence of a metrics tool, a logging tool, and a tracing tool.

In short: do not mistake the coexistence of metrics, tracing/APM and logging for “Observability.”

Like most bad ideas that gain some momentum, there is a grain of truth here: in particular, that the traces, metrics, and logs all have a place in the solution. But they are not the product; they are just the raw data. The telemetry.

The anatomy of observability.

The first layer is telemetry; we cannot have observability without the raw telemetry data. Depending on the infrastructure and available resources there are many options for tools for gathering the raw telemetry. The primary focus in the first layer should always be to gain access to high quality telemetry.

The second Layer is storage; not just how we store it, but for how long too. When considering data stores for observability, finding the right balance can be a predicament. Fundamentally, if we want to handle high-throughput data efficiently (for example, accounting for 100% of all messages passed in a scaled-out app, or even taking high-fidelity infra measurements like CPU load or memory use per container), we must record statistics to a time-series database. Otherwise, we waste too much on the transfer and storage of individual events. And while some might suggest you can sample the events, for low-frequency data hidden within the high-frequency firehose, you can miss it altogether. This situation calls for a dedicated Time Series DB (TSDB): a data store designed specifically for the storage, indexing, and querying of time-series statistics like these.

And yet! If we want to handle high-cardinality data (for example per-customer tags, unique ids for ephemeral infrastructure, or URL fragments), a TSDB is an unmitigated disaster. With the explosion of tag cardinality comes an explosion of unique time series, and with it an explosion of cost. And so, there must be a Transaction DB as well; traditionally this was a logging database, although it’s wiser to build around a distributed- tracing-native Transaction DB (more on this later) that can kill two birds (logs and traces) with one stone.

Still, finding state of the art Transaction and Time Series Databases is necessary but not sufficient. To make the actual “Observability” piece seamless, the data layer needs to be integrated and cross-referenced as well, preferably a deep integration.

The challenges above can sometimes make observability difficult and at times, it may feel elusive. And this brings us to the third layer, the actual benefits; in product management realm, they would simply be called the business outcomes and they are an essential part of the value proposition canvas when selling observability & monitoring to our customers.

At the end of the day, telemetry, whether in motion or at rest, is not intrinsically valuable. It’s only the workflows and applications built on top that can be valuable. Yet in the conventional presentation of “Observability as Metrics, Logs and Traces,” we don’t even know what problem we’re solving! Much less how we’re solving it.

When it comes to modern, distributed software applications, there are two overarching problems worth solving with Observability:

Understanding Health: Connecting the well-being of a subsystem back to the goals of the overarching application and business via thoughtful monitoring.
Understanding Change: Accelerating planned changes while mitigating the effects of unplanned changes.

Monitoring und Beobachtbarkeit gehen also Hand in Hand, das eine ersetzt das andere nicht, sondern ermöglicht und verbessert gemeinsam definierte Geschäftsergebnisse.

As a final word Monitoring and Observability go hand in hand, one does not replace another but together enable and enhance defined business outcomes.

For more information visit Swisscom Public Cloud Services(opens in new tab). You can also reach out to us by contacting our experts here(opens in new tab) to help kick off your cloud solutions off the ground.

Chetan Goswami

DevOps Engineer

More getIT-articles

Ready for Swisscom

Find the job or career world that suits you. In which you want to help shape and develop yourself.

What you make of it is what defines us.