The Relativity of Wrong: Navigating relationships across monitored entities

I have described the Cisco FSO Platform as an extensible, developer friendly platform that can ingest all kinds of telemetry and is able to correlate those data into a meaningful insight.

But... what does it really mean? Some readers told me it's an abstract concept, they don't get how it relates to their daily job in IT Operations.

Let's define telemetry first: it is all the data that you can get from a running system, like a Formula 1 car running on the race track (speed, consumption, temperature, remaining fuel, etc.). Or from your IT systems, that include applications, infrastructure, cloud, network, etc. In this case, data come in the form of Metrics (any number you can measure), Events (something that happened at an instant in time), Logs (information written by a system somewhere) and Traces (description of the execution of a process).

This is the origin of the acronym MELT, that you see written on the walls these days. Everyone is excited by Observability, that is the ability to infer the internal state of the system by looking at its external signals (e.g. collecting MELT). Generally, Observability is realised within a domain: a consistent set of assets of the same type (technologies, devices, or business processes). Example: network monitoring, application performance monitoring (APM), etc.

The fun comes when you're able to correlate MELT to investigate the root cause of an issue, or to find spots for optimising either performance or cost, or to demonstrate business stakeholders that all the business KPI are OK thanks to the good job done by the IT Operations folks :-)

Even better when you're able to correlate MELT across different domains, to extend observability end-to-end. The entire business architecture is under control. You can navigate all the relationships that link the entities that are relevant in your monitoring, and see if any of those is affecting the global outcome (faults, bottlenecks, etc.).

Example: LinkedIn

One illuminating example for this type of navigation is the parallelism with the LinkedIn website, and the exploration of your network of contacts to find a specific person, or information about their professional role, their company, their activity.

Every IT professional I know has a profile on LinkedIn, and each of them generates information: they post articles or photos, they react to others' posts (either repost, or suggest/like them), they advertise events, they update their profile (this can be associated to generating MELT). In addition, everyone is connected to other people, so that you have 1st degree (direct) connections but also 2nd degree connections that you inherit from the 1st degree ones.

Click on the video below to see a graphical representation of the navigation across a network of connections on Linkedin, and the flow of information generated by each one of the people in the network.

Now you can imagine a similar network of logical connections among entities that you monitor with the Full Stack Observability platform. You can explore how they are related to each other, and how every one affects the behaviour and the outcome of the others.

In a typical IT scenario, the entities might be the navigation of a user in the software application that supports a digital service (a Business Transaction), a service, the Kubernetes cluster where the service is running, a K8s node, the server running the node (that might be a VM in the cloud), the network segment to connect to the cloud, the cost of cloud resources, the carbon footprint generated by the infrastructure.

Correlation

All the relationships among the monitored entities are explicitly shown in the user interface, and you can move your focus to another object and inspect it, accessing the current health state, its history, and all the Metrics, Events, Logs and Traces it has generated. This makes extremely easy to understand if an issue detected in one of the entities propagates to others, affecting the way they work.

Also the Health Rules that you can define for one entity could include the evaluation of related entities, so that you roll up warnings and awareness at the top level based on what supporting entities are doing.

In this screenshot I've highlighted the list of relationships in the panel on the left side, with a green dashed line. That list continues, so scrolling down you would also see Workloads, Pods, Containers, Hosts, Configurations, Persistent Volume Claims, Ingresses, Load Balancers and Teams (yes, the organisational teams that are responsible for this cluster). The number on each entity type shows how many objects of that type are related to the one (the K8s cluster) that is currently in focus in the central pane.

Though we have information about all the entities in the system, all the objects that are not in direct relationship with the entity in focus are automatically hidden in the list, to remove what we call the "background noise". Showing only what really matters increases focus, and makes the investigation easier. You can click, let's say, on the two Business Transactions (luckily in this example both are in green health state) to see what business processes would be impacted by a problem occurring in this K8s cluster.

Of course, scrolling down we would see in the central panel all the information available about this cluster, including all the MELT it has generated in the time interval under investigation (see the options below).

What I have described in this post is just the basic capabilities of the Cisco FSO Platform. You can find the full detail in the official documentation.

In next posts, I'll explain the most relevant use cases and the impact that Full Stack Observability can have on your business.

The Relativity of Wrong

Pages

July 14, 2023

Navigating relationships across monitored entities

Example: LinkedIn

Correlation

No comments:

Post a Comment