Cloud

Monitoring and Observability in Action

This is the part two of a two part blog, the first part explains how monitoring and observability go hand in hand. In this blog post we will describe how we are doing observability and monitoring at Public Cloud Managed Services (PCMS) for our customers as part of our Managed Services offering.

By Chetan Goswami, DevOps Engineer

21 February 2022

How we are doing observability and monitoring at PCM

Right out of the bat, I want to say that the landscape of our customer’s infrastructure and applications is continuously evolving. As the number and the diversity customers is growing, so is the product market fit. Our challenges include onboarding new customers which almost always translates to figuring out how we will monitor new cloud services and applications, at-times in cloud native and at-times in hybrid environment.

A picture speaks a thousand words, the diagram above details the observability and monitoring landscape at PCMS for Azure (we also have AWS customers but that’s another blog post for another day soon).

We have the first layer, telemetry, in the top right-hand-side . This includes data sources of our public cloud customers, like different types of logs of the PaaS and SaaS cloud resources, e.g. activity logs or sign-in logs gathered using EventHub and their metrics gathered from Azure Monitor. In addition, we have the IaaS resources sending their metric and logs using Beats agents directly to our Elastic stack, and also grab the administrative and activity logs of O365 and M365 cloud services to provide observability & monitoring solutions for deliver cloud workplace domain. Last but not the least is the upcoming trend of hybrid scenarios, where our customers are connecting their multiple clouds and on-premises infrastructure, we gather logs and metrics of certain resources, at time directly and sometimes via public clouds route; this is when customers are using public cloud services like Azure Arc and AWS outpost.

The second layer is data storage. Before we go into that though, it is important to understand data ingestion. We have two data ingestion endpoints. They can be simply broken down into push and pull modes. We have our Metricbeat and Filebeat agents running on Kubernetes infrastructure pulling metrics, logs, and traces from public clouds. This concerns primarily PaaS and SaaS resources, the exception being the activity logs which are valid for all types of resources. The second ingestion endpoint is our Elastic ingest nodes, which resides directly in our Elastic stack. The agents running on IaaS or sometimes compute-based cloud appliances send their metrics, logs, and traces directly to our Elastic stack. Once the data is ingested, it is stored in Lucene indices in the backend. The retention of data depends on the data type and the business use case. It remains nonetheless configurable per dataset, for customers with specific retention requirements.

The third and final layer is Actual benefits; this is where the magic happens. The first part is monitoring individual component health, which is derived from monitoring profiles. A typical monitoring profile contains more than one of the following components below:

Alerts based on metrics, logs, and trace
Machine Learning jobs, anomaly detection jobs based on datasets
SIEM, security events based on incoming dataset
Dashboards and visualizations, based on the incoming datasets
Saved Searches, concerning the incoming datasets

Another benefit of our observability and monitoring stack, is enabling the operations teams to understand changes. With the ability to directly query the underlying datasets composed of all sorts of logs and metrics, the operations teams can answer the question "What caused the change?”. This can be summed up as:

Service deployment
Configuration push
Workload change
A broken cloud dependency

Or “What is the impact of that change?” This can be summed up as:

Customer experience
Service health and performance (aka SLOs)
Resource consumption and costs.

As a repetition of the final word from the first part of the blog series, Monitoring and Observability go hand in hand, one does not replace another but together enable and enhance defined business outcomes.This is the part two of a two part blog, the first part which explains how monitoring and observability go hand in hand, is available here.

Chetan Goswami

DevOps Engineer

More getIT-articles

Ready  for  Swisscom

Find the job or career to suit you. A career where you can make a difference and continue your personal development.

What you do is who we are.

Monitoring and Observability in Action

How we are doing observability and monitoring at PCM

Chetan Goswami

More getIT-articles

Ready  for  Swisscom

Go to careers

Go to current cyber security vacancies

How we are doing observability and monitoring at PCM

Chetan Goswami

More getIT-articles

Ready for Swisscom

Go to careers

Go to current cyber security vacancies

Ready  for  Swisscom