The Relativity of Wrong

July 28, 2023

Why Application Security is important (and complementary to perimeter security)?

Outstanding application security is foundational to a brand's reputation, creating and building trust and loyalty with users. But vulnerabilities can occur anytime, anywhere (in your code, in commercial applications, in libraries you've integrated and in remote API that you invoke), making it difficult and time-consuming to prioritize responses.

<Suggestion for people in a rush> If you only have 5 minutes, just scroll down and look at the amazing recorded demo: it explains everything better than the post itself </Suggestion for people in a rush>

Avoiding costly delays that can result in continuing damage to revenue and brand reputation means organizations must have clear visibility into each new vulnerability and the insights needed to prioritize remediation based on their business impact.

The traditional security schema, based on just protecting the perimeter with firewalls and IPS, is no longer sufficient. You need to protect the full stack, including all the software tiers.

Business Risk Observability

Speed and coordination are paramount when dealing with application security risks.

Bad actors can take advantage of gaps and delays between siloed security and application teams, resulting in costly and damaging consequences. Traditional vulnerability and threat scanning solutions lack the shared business context needed to rapidly assess risks and align teams based on potential business impact. To triage and align teams as fast as possible, teams need to know where vulnerabilities and threats impact their applications, how likely a risk is to be exploited, and how much business risk each issue presents.

One fundamental use case in Full-Stack Observability is business risk observability, supported by new levels of security intelligence capability that brings business context into application security. The new business risk scoring enables security and applications teams to have a greater threat visibility and intelligent business risk prioritization, so that they respond instantly to revenue-impacting security risks and reduce overall organizational risk profiles.

New Cisco Secure Application features and functionalities include business transaction mapping to understand how and where an attack may occur; threat intelligence feeds from Cisco Talos, Kenna, and Panoptica; and business risk scoring.

Business Transaction Mapping

New business transaction mapping locates how and where an attack may occur within common application workflows like ‘login, checkout, or complete payment’ so that ITOps and SecOps professionals can instantly understand the potential impact to your application and your bottom line.

Threat Intelligence Feeds

New threat intelligence feeds from Cisco Talos, Kenna, and Panoptica provide valuable risk scores from multiple sources to assess the likelihood of threat exploits

Business Risk Scoring (for Security Risk Prioritization)

New Business risk scoring combines threat and vulnerability intelligence, business impact and runtime behavior to identify the most pressing risks, avoiding delays, and speeding response across teams.

Video Demonstration of the Business Risk Observability use case

See a complete, explanatory demonstration of how a risk index associated to your business transactions allows to discover and remediate vulnerabilities with a proper priority assessment:

https://video.cisco.com/detail/video/6321988561112

July 14, 2023

Navigating relationships across monitored entities

I have described the Cisco FSO Platform as an extensible, developer friendly platform that can ingest all kinds of telemetry and is able to correlate those data into a meaningful insight.

But... what does it really mean? Some readers told me it's an abstract concept, they don't get how it relates to their daily job in IT Operations.

Let's define telemetry first: it is all the data that you can get from a running system, like a Formula 1 car running on the race track (speed, consumption, temperature, remaining fuel, etc.). Or from your IT systems, that include applications, infrastructure, cloud, network, etc. In this case, data come in the form of Metrics (any number you can measure), Events (something that happened at an instant in time), Logs (information written by a system somewhere) and Traces (description of the execution of a process).

This is the origin of the acronym MELT, that you see written on the walls these days. Everyone is excited by Observability, that is the ability to infer the internal state of the system by looking at its external signals (e.g. collecting MELT). Generally, Observability is realised within a domain: a consistent set of assets of the same type (technologies, devices, or business processes). Example: network monitoring, application performance monitoring (APM), etc.

The fun comes when you're able to correlate MELT to investigate the root cause of an issue, or to find spots for optimising either performance or cost, or to demonstrate business stakeholders that all the business KPI are OK thanks to the good job done by the IT Operations folks :-)

Even better when you're able to correlate MELT across different domains, to extend observability end-to-end. The entire business architecture is under control. You can navigate all the relationships that link the entities that are relevant in your monitoring, and see if any of those is affecting the global outcome (faults, bottlenecks, etc.).

Example: LinkedIn

One illuminating example for this type of navigation is the parallelism with the LinkedIn website, and the exploration of your network of contacts to find a specific person, or information about their professional role, their company, their activity.

Every IT professional I know has a profile on LinkedIn, and each of them generates information: they post articles or photos, they react to others' posts (either repost, or suggest/like them), they advertise events, they update their profile (this can be associated to generating MELT). In addition, everyone is connected to other people, so that you have 1st degree (direct) connections but also 2nd degree connections that you inherit from the 1st degree ones.

Click on the video below to see a graphical representation of the navigation across a network of connections on Linkedin, and the flow of information generated by each one of the people in the network.

Now you can imagine a similar network of logical connections among entities that you monitor with the Full Stack Observability platform. You can explore how they are related to each other, and how every one affects the behaviour and the outcome of the others.

In a typical IT scenario, the entities might be the navigation of a user in the software application that supports a digital service (a Business Transaction), a service, the Kubernetes cluster where the service is running, a K8s node, the server running the node (that might be a VM in the cloud), the network segment to connect to the cloud, the cost of cloud resources, the carbon footprint generated by the infrastructure.

Correlation

All the relationships among the monitored entities are explicitly shown in the user interface, and you can move your focus to another object and inspect it, accessing the current health state, its history, and all the Metrics, Events, Logs and Traces it has generated. This makes extremely easy to understand if an issue detected in one of the entities propagates to others, affecting the way they work.

Also the Health Rules that you can define for one entity could include the evaluation of related entities, so that you roll up warnings and awareness at the top level based on what supporting entities are doing.

In this screenshot I've highlighted the list of relationships in the panel on the left side, with a green dashed line. That list continues, so scrolling down you would also see Workloads, Pods, Containers, Hosts, Configurations, Persistent Volume Claims, Ingresses, Load Balancers and Teams (yes, the organisational teams that are responsible for this cluster). The number on each entity type shows how many objects of that type are related to the one (the K8s cluster) that is currently in focus in the central pane.

Though we have information about all the entities in the system, all the objects that are not in direct relationship with the entity in focus are automatically hidden in the list, to remove what we call the "background noise". Showing only what really matters increases focus, and makes the investigation easier. You can click, let's say, on the two Business Transactions (luckily in this example both are in green health state) to see what business processes would be impacted by a problem occurring in this K8s cluster.

Of course, scrolling down we would see in the central panel all the information available about this cluster, including all the MELT it has generated in the time interval under investigation (see the options below).

What I have described in this post is just the basic capabilities of the Cisco FSO Platform. You can find the full detail in the official documentation.

In next posts, I'll explain the most relevant use cases and the impact that Full Stack Observability can have on your business.

July 8, 2023

FSO Platform: see everything, correlate everything

The Cisco Full Stack Observability Platform

Cisco has been the first vendor to offer a end-to-end observability solution, based on complementary products that are integrated into each other. The use cases described in my previous post are served by a combination of AppDynamics and ThousandEyes, with information fed by first class security system as Talos, Kenna and Panoptica (more in next posts).

Even if another vendor had such an extensive coverage (and they have not), they would not be integrated out of the box. The native integration enhances the power of each product (Applications Ops see also the network, Network Ops see also the applications, Security Ops see everything, everybody get the business context) and saves a lot of time and effort that a custom integration would require.

But we think this is not enough.

Some companies are already very advanced in their journey to Observability. They have already adopted advanced solutions from APM vendors (including Cisco and competitors), network monitoring and cloud services monitoring. Some have built sophisticated home grown systems for Observability and AIOps.

They might find that the predefined view of the world that is implemented in traditional APM solutions is not enough. Entities like an Application, a Service, a Business Transaction and their relationship might not be sufficient to describe their business domain, or a technical domain that is more complicated than common architectures. They would like to extend the domain model, but they can't because the solution has not been designed for extensibility.

Extensibility of the Observability solution

What they are looking for is the possibility to extend their visibility, and the possibility to correlate collected information to describe what's relevant for them.

Here comes the Cisco FSO Platform.

The Cisco FSO Platform is an open, extensible, API driven platform that empowers a new observability ecosystem for organizations. It is a unified platform built on OpenTelemetry (an open source project by CNCF) and anchored on metrics, events, logs and traces (MELT), enabling extensibility from queries to data models with a composable UI framework.

Cisco FSO Platform is a developer friendly environment to build your own view of the world.

You can tailor the Full Stack Observability to your business domain, or to your technical domain, defining the entities that are relevant for your stakeholders and the relationships that tie them. From business processes to every asset included in your architecture: applications, infrastructure, cloud, network, IoT and business data sources.

Creating a series of connections that you can navigate to fully control what's going on, as you do on Linkedin exploring a people's network and the information they generate (see next post for an example). All based on telemetry that you can collect from virtually everything: Metrics, Events, Logs and Traces. A new open standard, OpenTelemetry (supported by vendors and by the open source community), defines the way data are collected and ingested. These data feed the domain model, and you can later use them to investigate about the root cause of any issue, or to report about the business health state, or to look for opportunities to improve the efficiency.

The Cisco FSO platform is a differentiated solution that brings data together from multiple domains such as application, networking, infrastructure, security, cloud and business sources. Users can get correlated insights that reduce time to resolve issues and optimize experiences; while Partners, ISVs, and software developers can now build meaningful FSO applications enabling new use cases.

So there are alternative solutions for Full Stack Observability?

In their evolution from traditional monitoring, organizations go through some maturity steps. It's not a revolution in one day.

Someone starts replacing individual tools with more complete solutions that unify the visualization of collected metrics from different technical domains. Others start correlating those data with business metrics and KPI. Then they extend the observability to - really - the full stack.

For all those, the solution that I started describing in my previous post provides an excellent value. The seven use cases I've mentioned are completely supported by the Cisco FSO solution based on the integration of Appdynamics, ThousandEyes and the security ecosystem. It's well integrated and offers the various operations teams access to deep visibility as well as shared business context.

Some organizations are already in a more advanced state. They have already realized the Full Stack Observability, either adopting the Cisco solution or a competing one, or growing a AIOps system in house. But they feel that they need more, because their business domain (or parts of their technical domain) is not completely covered by the solution they have.

Thanks to the Cisco FSO Platform, that is extensible and developers friendly, they can build the needed extension themselves (or can have a look at the Cisco FSO App Exchange). This powerful engine, that backs all the Cisco FSO products, will allow those organizations to ingest telemetry from virtually every asset and to show correlated data based on their desired view of the world.

So finally we have two parallel motions, that don't conflict necessarily. The adoption of one or the other depends on your current observability maturity level and specific need for tailored dashboards.

In next post I will show a parallelism between the navigation across your LinkedIn network of contacts and the navigation through connected entities in the FSO Platform to search for the root cause of an issue by exploring Metrics, Events, Logs and traces associated to each entity.

Subsequently, I will describe fundamental use cases like Business Risk Observability.

June 29, 2023

Full Stack Observability use cases

Business Use Cases

Full Stack Observability is all about collecting any possible data from the applications running your digital services (i.e. business KPI) and from the infrastructure and cloud resources supporting them (i.e. the telemetry), including potentially also IoT, robots or whatever device involved in the process.

And then correlating those data to create an actionable insight, so that you have full control of your business processes end-to-end and you do better than your competitors (faster, more reliable, more appealing processes and services).

The FSO value proposition is not only related to technology (the infrastructure that you can monitor and the metrics you can read). It is a business value proposition, because observability has an immediate impact on the business outcomes.

Associating business processes, and digital services supporting those, with the health state of the infrastructure gives the Operations teams an immediate and objective measure of the value - or the troubles - that IT provides to their internal clients, that are the lines of business (LOB). And LOB managers can enjoy dedicated dashboards that show how the business is doing, highlighting all the key performance indicators (KPI) that are relevant for each persona in the organization.

If there is any slowdown in the business, they see it instantly and can eventually relate it to a technical problem, or maybe to the release of a new version of a software application, or to the launch of a new marketing campaign. The outcome of any action and of any incident is connected to the business with... no latency. The same visibility is also useful when the business shows a better performance than the day before. You can relate outcomes to actions and events.

So, before speaking about the technology that supports the Full Stack Observability, let's discuss about the use cases and their impact.

We can group the use cases in three categories: Observe, Secure and Optimize (referred to your end-to-end business architecture).

In the Observe category, we have 4 fundamental use cases:

- Hybrid application monitoring

This refers to every application running on Virtual Machines, in any combination of your Data Center and Public Clouds, or on bare metal servers.

You can relate the business KPI (users served, processes completed, amount of money, etc.) to the health state of the software applications and the infrastructure. You can identify the root cause of any problem and relate it to the business transactions (= user navigation for a specific process) that are affected.

- Cloud native application monitoring

Same as the previous use case, but referred to applications designed based on cloud native patterns (e.g. microservices architecture) that run on Kubernetes or Openshift. Regardless it's on premises, in cloud, or in a hybrid scenario. Traditional APM solutions were not so strong on this use case, because they were designed for older architectures.

- Customer digital experience monitoring

Here the focus is on the experience from the end user perspective, that is affected by the performance of both the applications and the infrastructure, but also - and mostly - by the network. Network problems can eventually affect the response time and the reliability of the service because the end user needs to reach the end point where the application is run (generally a web server), the front end needs to communicate with the application components distributed everywhere, and these may be invoking remote API exposed by a business partner (e.g. a payment gateway or any B2B service).

- Application dependency monitoring

In this use case you want to assure the performance of managed and unmanaged (third-party) application services and APIs, including performance over Internet and cloud networks to reach those services. Visibility of network performance and availability, including both public networks and yours, is critical to resolve issues and to push service providers to respect the SLA of the contract.

In the Secure category, we can discuss the Business Risk Observability use case:

- Application security

Reduce business risk by actively identifying and blocking against vulnerabilities found in application runtimes in production. Associate vulnerabilities with the likelihood that they are exploited in your specific context, so that you can prioritize the suggested remediation actions based on the business impact (shown by the association of vulnerabilities with Business Transactions).

In the Optimize category, we have the following use cases:

- Hybrid cost optimization

Lower costs by only paying for what you need in public cloud and by safely increasing utilization of on—premises assets.

- Application resource optimization

Improve and assure application performance by taking the guesswork out of resource allocation for workloads on—premises and in the public cloud.

Observability and network intelligence coming together

The use cases listed above goes beyond the scope of traditional APM solutions (Application Performance Monitoring) because they require to extend the visibility to every segment of the network. This picture shows an example of possible issues that can affect the end user experience, and need to be isolated and remediated to make sure the user is happy.

That is generally difficult, and requires a number of subject matter experts in different domains, and a number of tools. Very few vendors can offer all the complementary solutions that give you visibility on all aspects of the problem. And, of course, they are not integrated (vertical, siloed monitoring).

Data-driven bi-directional integration

The Full Stack Observability solution from Cisco, instead, covers all the angles and - in addition - it does so in a integrated fashion. The APM tool (AppDynamics) and the Network Monitoring tool (ThousandEyes) are integrated bidirectionally through their API (out of the box, no custom integration is required).

The visibility provided by one tool is greatly enhanced by data coming from the other tool, that are correlated automatically and shown in the same console.

So, if you're investigating about a business transaction, you don't see just the performance of the software stack and its distributed topology, but also the latency, packet loss, jitter and more network metrics in the same context (exactly in the network segments that impact the traffic for that single business transaction, at that instant in time).

Similarly, if you're looking at a network, you immediately know what applications and business transaction would be affected if it fails or slows down. And automated tests can be generated to monitor the networks and the end points, that are created automatically from the topology of the application that the APM tool has discovered.

Exciting times are coming, the Operations teams can expect their life to be much easier when they start adopting a Full stack Observability approach. More detail in next posts...

June 14, 2023

Changing the focus of this blog: now... Observability

My previous post about Infrastructure as Code concludes the exploration of Data Center and Cloud solutions and the methodologies that are related to automation and optimization of IT processes.

I've been working in this area for 15 years, after spending the previous 15 in software development.
It's been an amazing adventure and I really enjoied learning new things, exploring and challenging some limits - and sharing the experience with you.

Now I start focusing on a new adventure... possibly for the next 15 years 😜

I assumed a new professional role, that is the technical lead for Full Stack Observability, EMEA sales, at Cisco Appdynamics. From now on, I will tell you stories about my experience with the ultimate evolution of monitoring: it's all about collecting telemetry from every single component of your business architecture, including digital services (= distributed software applications), computing infrastructure, network, cloud resources, IoT, etc.

It's not just putting all those data together, but correlating them to create an insight. Transforming raw data into information about the health state of your processes, matching business KPI with the state of the infrastructure that supports the services.

To visualize that information and to navigate it, you can subscribe to (or create your own) different domain models, that are views of the world built specifically for each stakeholder: from lines of business to applications managers, from SRE to network operations and security teams...

A domain model is made of entities and their relationships, where entities represent what is relevant in your (business or technical) domain. They might be the usual entities in a APM domain (applications, services, business transactions...) or infrastructure entities (servers, VM, clusters, K8s nodes, etc.).
You can also model entities in a business domain (e.g. trains, stations, passengers, tickets, etc.).

Unlike Application Performance Monitoring (APM), where solutions like Appdynamics and its competitors excel in drilling down in the application architecture and its topology, with Full Stack Observability you really have full control end-to-end and have a context shared among all the teams that collaborate at building, running and operating the business ecosystem.

New standards like OpenTelemetry make it easy to convey Metrics, Events, Logs and Traces (MELT) to a unique platform from every single part of the system, including eventually robots in manufacturing, GPS tracing from your supply chain, etc.

All these data will be filtered according to the domain model and those that are relevant will feed the databases containing information about the domain entities and their relationships, that are used to populate the dashboard.

Those data will be matched with any other source of information that is relevant in your business domain (CRM, sales, weather forecast, logistics...) so that you can analyse and forecast the health of the business and relate it to the technologies and the processes behind. You can immediately remediate any problem because you detect the root cause easily, and even be proactive in preventing problems before they occur (or before they are perceived by end users). At the same time, you are able to spot opportunities for optimising the preformances and the cost efficiency of the system.

To see what is the official messaging from Cisco about the Full Stack Observability, check this page describing the FSO Platform.

Stay tuned, interesting news are coming...

June 9, 2023

Infrastructure as Code: making it easy with Nexus as Code

In previous posts I've described the advantage provided by managing the infrastructure the same way developers manage the application code.

Infrastructure as Code means using the same toolset (version control systems, pipeline orchestrators, automated provisioning) and same processes for building, integrating, testing and releasing the system that are used in the release cycle of a software application. This approach has a positive impact on speed, reliability and security end to end.

Together with Ansible, Terraform is one of the most used tools in the automated provisioning space, and many organizations use it when they adopt Infrastructure as Code. The availability of plugins (Terraform Providers) for almost every possible target (physical and virtual servers, network and storage, cloud services, etc.) makes it a common platform for automation: a "de facto" standard.

As many other technology vendors, Cisco offers Terraform Providers wrapping the API of their products, especially for Data Center and Cloud technologies. The Nexus family of switches, that includes the ACI fabric architecture, makes no exception. You can provision and manage the ACI fabric easily with Terraform (as well as with Ansible), and many examples and reusable assets are available at DevNet.

Generally, Terraform Providers surface the object model of the target system so that resources and the their relationships can be managed easily in a configuration plan, representing the desired state of the system. You need to understand how that particular system works and, in some cases, to manage the relationships among managed objects identifiers explicitly.

This is an example of creating a tenant in ACI, and a VRF contained in it:

Some engineers find this object model, and the use of the HCL (Hashicorp Configuration Language), easy and comfortable. Others, maybe due to a limited experience, would prefer an easier syntax and simpler object model.

For this reason Cisco has created a module called Nexus as Code, that sits on top of the standard ACI provider for Terraform, hiding the perceived complexity and offering a simplified object model. The objects that are contained in each other are simply nested and represented in a way that's very close to the conceptual representation of the logical architecture (represented by the following picture)

Nexus as Code can be seen as a (optional) component in the Terraform solution to automate ACI and other network controllers from Cisco.

Using a configuration language as simple as YAML, nesting is represented with indentation in Nexus as Code. This example corresponds to the HCL snippet above:

This format is particularly suitable for copy/paste operations, that make it easy to clone and modify a template so it is ready for a new project.

If you start from the example above, simply copying one line you can have one more VRF created and contained in the same tenant. Definitely simpler that doing the same in a HCL file, and encouraging for a network engineer the first time he/she uses Terraform.

Everything you need is a folder to store one or more YAML file defining the desired state of the ACI fabric, and the installation of the Terraform binary file (free download from here). After that, you will just use the following two commands:

terraform init (that makes sure that the needed providers are installed, and eventually downloads them automatically)

terraform apply (that reads the input, evaluates changes required to align the state of the target fabric to the desired state, then call the API of the ACI controller)

when you confirm the apply, you will see the log of the execution and finally the message will tell that the job is done.

I believe that Nexus as Code is a powerful tool that may help engineers to approach the IaC (Infrastructure as Code) methodology easier, with no stress due to learning new complex technologies and tools.

Being based on standard, open-source tools, it does not introduce any lock in with Cisco technologies.
It simply translates easy-to-manipulate YAML files, that describe your desired state, into plain Terraform plans that are executed automatically.

So you can start adopting the same tools and same processes that developers use in building, integrating, testing and releasing the system, obtaining the same benefit in terms of speed, consistency, security and self-documentation.

Don't be shy, start today to experiment and see how easy it is 😜

July 4, 2022

Infrastructure as Code: tools and processes

In a previous post we have seen that Infrastructure as Code is a way of managing the infrastructure and the cloud resources, consisting in a set of processes and best practices.

But there is also a need for a set of tools, and this post will offer an overview of the most used tools in the industry. Most of them are free, open source tools, matched by a vendor-supported version that requires a license or a subscription. There are also SaaS versions that offer a free tier.

I'm describing the following tools in this post:

Version Control Systems
Automation tools (Ansible, Terraform)
Accessory tools for "scaffolding" (Vault, etc).

But, before we explore the tools, just a few more words about the process.

Programmable infrastructure

The simple fact that the infrastructure is programmable via the API it exposes, does not mean that anyone can come and change its configuration and/or state.

We don't want anarchy, and even less we want that programmers do whatever they like bypassing the owners of the target technology domain. The administrators, that are also the SME (subject matter experts) have the responsibility to ensure the reliability, performance and security of the system and cannot afford that a naive developer compromises it.

So what we mean with treating the infrastructure as code is applying the same processes, and same tools, as we do with the source code of the applications. The infrastructure provisioning and configuration should follow the same process that we implement for the applications: write the code, version the code, test it statically for quality and security, deploy it automatically, test it dynamically (functional, performance, reliability and security tests), then deploy it in the production environment. Generally, it happens within a CI/CD pipeline with a good level of automation (but the same sequence could also be executed manually).

Now that we have agreed on the basic principles, let's have a look at the tools.

Version Control Systems

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.

The purpose of version control is to allow software teams track changes to the code, while enhancing communication and collaboration between team members. Version control facilitates a continuous, simple way to develop software.

Since we want to manage the infrastructure as developers do with the software applications, we use the same organization for the files describing the desired state of the infrastructure (remember, infrastructure includes physical, virtual and cloud resources).

A central repository is the single source of truth. Local working copies can be used to evolve the code, to create new versions and test them. After validation, a new version is committed in the VCS (version control system). The most used tools are github and gitlab, but there is a large choice available.

Pipeline orchestrators

When a new version is created, a number of activities must take place: they can be executed manually or, better, automated to increase speed and reduce vulnerability to human errors.

If any task or test fail, a notification is sent to the right stakeholder to solve the problem and the pipeline aborts. A new pipeline cycle will restart after the issue has been fixed. You might build a single pipeline for the infrastructure and for the application deployment, or more often separate them in distinct processes: depending on the work organization and the availability of resources, there is no strict need to rebuild the target environment every time a new application version is released.

Orchestrators for CI/CD pipelines can be open-source or commercial products or can be engines incorporated in the version control system.

The following picture shows an example of pipeline:

Automation tools (Ansible, Terraform)

Those are not the only tools available for automation, but they are by large the most used.

They access the target system (infrastructure, cloud and components in the software stack) remotely, with no need for a local agent.

Generally the target API are wrapped in plugins for the automation engine (called Ansible modules and Terraform providers) that are either built and supported by the vendor of the target technology, or by the open source community.

Ansible was born for managing servers, so its approach is more orientated towards configuration management. Terraform excels at provisioning resources, and brings you to concepts like immutable infrastructure (see below).

Both the tools are great and let you define the desired state of the system, making sure that the current state matches the desired state. If it does not, changes are executed automatically by destroying configuration items and recreating them as they need to be. Indeed, Ansible tolerates changing the configuration of existing resources, in that it is more procedural than declarative.

Immutable Infrastructure

An approach to managing services and software deployments wherein components are replaced rather than changed. They are effectively redeployed each time any change occurs.

Traditionally, an application or service update requires that a component is changed in production, while the complete service or application remains operational. Immutable infrastructure instead relies on instancing "golden images", where components are assembled on computing resources to form the service or application.

Once the service or application is instantiated, its components are set - thus, the service or application is immutable, unable to change. When a change is made to one or more components of a service or application, a new golden image is assembled, tested, validated and made available for use. Then the old instance is discontinued, to free the computing resources within the environment for other tasks.

You can find a very good description and reusable examples at this two websites:

Immutable Infrastructure with Ansible, Packer and Terraform on Azure - https://devopsand.beer/2022/03/26/immutable-Infrastructure-with-ansible-packer-and-terraform-on-azure/

Immutable Infrastructure Using Packer, Ansible, and Terraform - https://medium.com/paul-zhao-projects/immutable-infrastructure-using-packer-ansible-and-terraform-a275aa6e9ff7

Accessory tools for "scaffolding"

The automation you can build with these tools is amazing, and it saves you time and troubles (sometimes also money as a consequence). As a single individual, or part of a small team, you are much more productive thanks to the reuse of scripts and blueprints, to less troubleshooting required, to higher speed in provisioning.

When the size of the operations team, or of the organization made of different teams, grows beyond a handful of people, some coordination issues start being visible.

If many people use the same scripts (playbooks, configuration plans, etc.) those resources need to be accessible in a central repository (generally a VCS) and you need to enforce RBAC (role based access control) to protect them.
Credentials to access the target systems cannot be stored within the code in the VCS, so you need to store them separately and pass them in as variables.
If a change is pushed to the environment, people need to be notified (even more if a pipeline fails and someone has to fix it).
Bespoke IaC pipelines can stretch across personal machines or shared VMs creating a management nightmare
Terraform state files contains sensitive information which requires special handling and access control

So you start defining processes to work in a ordered manner, and adopting accessory tools to store the secrets (one example is Vault, to store credentials in a safe, centralized place). The governance work and the tools you start accumulating are defined scaffolding, and rapidly become such a burden that they exceed the advantages you've got from adopting the Infrastructure as Code approach (this happens only at a large scale and if you don't have experienced staff).

A solution for this problem is offered by the enterprise version of the tools (both Ansible and Terraform), that is also offered as a SaaS option. The paid versions - that are also supported by the vendors - offer everything you need for large teams' collaboration and avoid that you need to invest in creating all the operational framework.

I'm not saying that you absolutely need those versions, but consider that the miracles an engineer can do with the free, single binary file, local setup of the automation tools are less likely to be seen on a larger team scale when the IaC best practices are broadly adopted. There will be an inflection point where the benefits provided by the enterprise edition justify the cost of the solution.

Pages